Skip to content

Commit 225f76f

Browse files
authored
[ENH] Wireup regex in filter operator (#4452)
## Description of changes _Summarize the changes made by this PR._ - Improvements & Bug fixes - Return prefix as part of `get_range` method - New functionality - Wire up regex evaluation process in filter operator ## Test plan _How are these changes tested?_ - [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ <!-- Summary by @propel-code-bot --> --- **Implementing Regex Matching in Filter Operator** This PR implements regex evaluation within the filter operator, enabling document-level regex matching. It changes blockfile reader APIs to expose prefixes as part of query results and leverages full-text indexes for more efficient regex evaluation. The implementation includes optimizations for exact pattern matching that avoids re-scanning documents when possible. **Key Changes:** • Add regex pattern evaluation in filter operator using `ChromaRegex` from `chroma_types` • Update blockfile reader `APIs` to return prefixes along with keys and values • Implement `NgramLiteralProvider` for `FullTextIndexReader` to optimize regex searches • Add benchmark for regex matching with various patterns • Return prefix as part of `get_range` and `get_range_stream` methods in blockstore **Affected Areas:** • Worker filter operator implementation • Blockstore reader/writer ``API`` • `FullText` index implementation • Distributed segment implementation • Metadata segment implementation **Potential Impact:** **Functionality**: Adds regex matching capability for document filtering, providing users with more powerful query options **Performance**: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation, improving search performance for complex patterns **Security**: No significant security implications **Scalability**: Efficient regex implementation should maintain good performance as document collections grow **Review Focus:** • Regex implementation efficiency in filter.rs • Error handling for regex pattern compilation and matching • ``API`` changes to return prefix in blockstore methods • Benchmark methodology for regex performance evaluation <details> <summary><strong>Testing Needed</strong></summary> • Benchmark different regex patterns against large document collections • Test complex regex patterns with edge cases • Verify consistent results between optimized and brute-force implementations • Test with documents of varying sizes and content </details> <details> <summary><strong>Code Quality Assessment</strong></summary> **rust/worker/src/execution/operators/filter.rs**: Well-structured implementation with good error handling, but has a potential performance bottleneck with sequential document fetching **rust/blockstore/src/memory/reader_writer.rs**: Multiple unwrap() calls could be replaced with expect() for better error messages **rust/index/src/fulltext/types.rs**: Possible lifetime parameter issues in the lookup_ngram_range method </details> <details> <summary><strong>Best Practices</strong></summary> **Performance**: • Consider parallelizing document fetches with `buffer_unordered` for concurrent I/O **Error Handling**: • Replace unwrap() with expect() for better error messages • Consider proper error propagation instead of unwrapping </details> <details> <summary><strong>Possible Issues</strong></summary> • Potential panics from unwrap() calls in several places • Sequential document fetching in filter operator might cause performance issues with large result sets • Lifetime parameter conflicts in lookup_ngram_range method • Fallback bitmap handling in assertions could cause test failures </details> --- *This summary was automatically generated by @propel-code-bot*
1 parent 0c1fc3b commit 225f76f

File tree

14 files changed

+358
-104
lines changed

14 files changed

+358
-104
lines changed

rust/blockstore/src/arrow/block/types.rs

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,7 @@ impl Block {
361361
&'me self,
362362
prefix_range: PrefixRange,
363363
key_range: KeyRange,
364-
) -> impl Iterator<Item = (K, V)> + 'me
364+
) -> impl Iterator<Item = (&'me str, K, V)> + 'me
365365
where
366366
PrefixRange: RangeBounds<&'prefix str>,
367367
KeyRange: RangeBounds<K>,
@@ -410,8 +410,16 @@ impl Block {
410410
Bound::Unbounded => self.len(),
411411
};
412412

413+
let prefix_array = self
414+
.data
415+
.column(0)
416+
.as_any()
417+
.downcast_ref::<StringArray>()
418+
.unwrap();
419+
413420
(start_index..end_index).map(move |index| {
414421
(
422+
prefix_array.value(index),
415423
K::get(self.data.column(1), index),
416424
V::get(self.data.column(2), index),
417425
)

rust/blockstore/src/arrow/blockfile.rs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -496,7 +496,7 @@ impl<'me, K: ArrowReadableKey<'me> + Into<KeyWrapper>, V: ArrowReadableValue<'me
496496
&'me self,
497497
prefix_range: PrefixRange,
498498
key_range: KeyRange,
499-
) -> impl Stream<Item = Result<(K, V), Box<dyn ChromaError>>> + Send + 'me
499+
) -> impl Stream<Item = Result<(&'me str, K, V), Box<dyn ChromaError>>> + Send + 'me
500500
where
501501
PrefixRange: RangeBounds<&'prefix str> + Clone + Send + 'me,
502502
KeyRange: RangeBounds<K> + Clone + Send + 'me,
@@ -533,7 +533,7 @@ impl<'me, K: ArrowReadableKey<'me> + Into<KeyWrapper>, V: ArrowReadableValue<'me
533533
&'me self,
534534
prefix_range: PrefixRange,
535535
key_range: KeyRange,
536-
) -> Result<Vec<(K, V)>, Box<dyn ChromaError>>
536+
) -> Result<Vec<(&'me str, K, V)>, Box<dyn ChromaError>>
537537
where
538538
PrefixRange: RangeBounds<&'prefix str> + Clone,
539539
KeyRange: RangeBounds<K> + Clone,
@@ -543,7 +543,7 @@ impl<'me, K: ArrowReadableKey<'me> + Into<KeyWrapper>, V: ArrowReadableValue<'me
543543
.sparse_index
544544
.get_block_ids_range(prefix_range.clone(), key_range.clone());
545545

546-
let mut result: Vec<(K, V)> = vec![];
546+
let mut result: Vec<(&str, K, V)> = vec![];
547547
for block_id in block_ids {
548548
let block_opt = match self.get_block(block_id, StorageRequestPriority::P0).await {
549549
Ok(Some(block)) => Some(block),
@@ -894,7 +894,7 @@ mod tests {
894894
Ok(c) => {
895895
let mut kv_map = HashMap::new();
896896
for entry in c {
897-
kv_map.insert(format!("{}/{}", prefix_query, entry.0), entry.1);
897+
kv_map.insert(format!("{}/{}", prefix_query, entry.1), entry.2);
898898
}
899899
for j in 1..=5 {
900900
let prefix = format!("{}/{}", "prefix", j);
@@ -1016,7 +1016,7 @@ mod tests {
10161016

10171017
let mut kv_map = HashMap::new();
10181018
for entry in materialized_range {
1019-
kv_map.insert(entry.0, entry.1);
1019+
kv_map.insert(entry.1, entry.2);
10201020
}
10211021
for i in 1..num_keys {
10221022
let key = format!("{}/{}", "key", i);

rust/blockstore/src/memory/reader_writer.rs

Lines changed: 44 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ impl<
101101
&'storage self,
102102
prefix_range: PrefixRange,
103103
key_range: KeyRange,
104-
) -> Result<impl Iterator<Item = (K, V)> + 'storage, Box<dyn ChromaError>>
104+
) -> Result<impl Iterator<Item = (&'storage str, K, V)> + 'storage, Box<dyn ChromaError>>
105105
where
106106
PrefixRange: RangeBounds<&'prefix str>,
107107
KeyRange: RangeBounds<K>,
@@ -120,7 +120,7 @@ impl<
120120

121121
Ok(values
122122
.into_iter()
123-
.map(|(key, value)| (K::try_from(&key.key).unwrap(), value)))
123+
.map(|(key, value)| (key.prefix.as_str(), K::try_from(&key.key).unwrap(), value)))
124124
}
125125

126126
pub(crate) fn count(&self) -> Result<usize, Box<dyn ChromaError>> {
@@ -308,10 +308,10 @@ mod tests {
308308
assert_eq!(values.len(), 2);
309309
assert!(values
310310
.iter()
311-
.any(|(key, value)| *key == "key1" && *value == "value1"));
311+
.any(|(_, key, value)| *key == "key1" && *value == "value1"));
312312
assert!(values
313313
.iter()
314-
.any(|(key, value)| *key == "key2" && *value == "value2"));
314+
.any(|(_, key, value)| *key == "key2" && *value == "value2"));
315315
}
316316

317317
#[test]
@@ -348,13 +348,13 @@ mod tests {
348348
assert_eq!(values.len(), 3);
349349
assert!(values
350350
.iter()
351-
.any(|(key, value)| *key == 1 && *value == "value1"));
351+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
352352
assert!(values
353353
.iter()
354-
.any(|(key, value)| *key == 2 && *value == "value2"));
354+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
355355
assert!(values
356356
.iter()
357-
.any(|(key, value)| *key == 3 && *value == "value3"));
357+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
358358
}
359359

360360
#[test]
@@ -375,10 +375,10 @@ mod tests {
375375
assert_eq!(values.len(), 2);
376376
assert!(values
377377
.iter()
378-
.any(|(key, value)| *key == 2 && *value == "value2"));
378+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
379379
assert!(values
380380
.iter()
381-
.any(|(key, value)| *key == 3 && *value == "value3"));
381+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
382382
}
383383

384384
#[test]
@@ -420,13 +420,13 @@ mod tests {
420420
assert_eq!(values.len(), 3);
421421
assert!(values
422422
.iter()
423-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
423+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
424424
assert!(values
425425
.iter()
426-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
426+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
427427
assert!(values
428428
.iter()
429-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
429+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
430430
}
431431

432432
#[test]
@@ -450,10 +450,10 @@ mod tests {
450450
assert_eq!(values.len(), 2);
451451
assert!(values
452452
.iter()
453-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
453+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
454454
assert!(values
455455
.iter()
456-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
456+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
457457
}
458458

459459
#[test]
@@ -487,13 +487,13 @@ mod tests {
487487
assert_eq!(values.len(), 3);
488488
assert!(values
489489
.iter()
490-
.any(|(key, value)| *key == 1 && *value == "value1"));
490+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
491491
assert!(values
492492
.iter()
493-
.any(|(key, value)| *key == 2 && *value == "value2"));
493+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
494494
assert!(values
495495
.iter()
496-
.any(|(key, value)| *key == 3 && *value == "value3"));
496+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
497497
}
498498

499499
#[test]
@@ -512,10 +512,10 @@ mod tests {
512512
assert_eq!(values.len(), 2);
513513
assert!(values
514514
.iter()
515-
.any(|(key, value)| *key == 2 && *value == "value2"));
515+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
516516
assert!(values
517517
.iter()
518-
.any(|(key, value)| *key == 3 && *value == "value3"));
518+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
519519
}
520520

521521
#[test]
@@ -549,13 +549,13 @@ mod tests {
549549
assert_eq!(values.len(), 3);
550550
assert!(values
551551
.iter()
552-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
552+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
553553
assert!(values
554554
.iter()
555-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
555+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
556556
assert!(values
557557
.iter()
558-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
558+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
559559
}
560560

561561
#[test]
@@ -574,10 +574,10 @@ mod tests {
574574
assert_eq!(values.len(), 2);
575575
assert!(values
576576
.iter()
577-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
577+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
578578
assert!(values
579579
.iter()
580-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
580+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
581581
}
582582

583583
#[test]
@@ -611,13 +611,13 @@ mod tests {
611611
assert_eq!(values.len(), 3);
612612
assert!(values
613613
.iter()
614-
.any(|(key, value)| *key == 1 && *value == "value1"));
614+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
615615
assert!(values
616616
.iter()
617-
.any(|(key, value)| *key == 2 && *value == "value2"));
617+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
618618
assert!(values
619619
.iter()
620-
.any(|(key, value)| *key == 3 && *value == "value3"));
620+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
621621
}
622622

623623
#[test]
@@ -636,10 +636,10 @@ mod tests {
636636
assert_eq!(values.len(), 2);
637637
assert!(values
638638
.iter()
639-
.any(|(key, value)| *key == 1 && *value == "value1"));
639+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
640640
assert!(values
641641
.iter()
642-
.any(|(key, value)| *key == 2 && *value == "value2"));
642+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
643643
}
644644

645645
#[test]
@@ -673,13 +673,13 @@ mod tests {
673673
assert_eq!(values.len(), 3);
674674
assert!(values
675675
.iter()
676-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
676+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
677677
assert!(values
678678
.iter()
679-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
679+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
680680
assert!(values
681681
.iter()
682-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
682+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
683683
}
684684

685685
#[test]
@@ -698,10 +698,10 @@ mod tests {
698698
assert_eq!(values.len(), 2);
699699
assert!(values
700700
.iter()
701-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
701+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
702702
assert!(values
703703
.iter()
704-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
704+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
705705
}
706706

707707
#[test]
@@ -735,13 +735,13 @@ mod tests {
735735
assert_eq!(values.len(), 3);
736736
assert!(values
737737
.iter()
738-
.any(|(key, value)| *key == 1 && *value == "value1"));
738+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
739739
assert!(values
740740
.iter()
741-
.any(|(key, value)| *key == 2 && *value == "value2"));
741+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
742742
assert!(values
743743
.iter()
744-
.any(|(key, value)| *key == 3 && *value == "value3"));
744+
.any(|(_, key, value)| *key == 3 && *value == "value3"));
745745
}
746746

747747
#[test]
@@ -760,10 +760,10 @@ mod tests {
760760
assert_eq!(values.len(), 2);
761761
assert!(values
762762
.iter()
763-
.any(|(key, value)| *key == 1 && *value == "value1"));
763+
.any(|(_, key, value)| *key == 1 && *value == "value1"));
764764
assert!(values
765765
.iter()
766-
.any(|(key, value)| *key == 2 && *value == "value2"));
766+
.any(|(_, key, value)| *key == 2 && *value == "value2"));
767767
}
768768

769769
#[test]
@@ -797,13 +797,13 @@ mod tests {
797797
assert_eq!(values.len(), 3);
798798
assert!(values
799799
.iter()
800-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
800+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
801801
assert!(values
802802
.iter()
803-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
803+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
804804
assert!(values
805805
.iter()
806-
.any(|(key, value)| *key == 3.0 && *value == "value3"));
806+
.any(|(_, key, value)| *key == 3.0 && *value == "value3"));
807807
}
808808

809809
#[test]
@@ -822,10 +822,10 @@ mod tests {
822822
assert_eq!(values.len(), 2);
823823
assert!(values
824824
.iter()
825-
.any(|(key, value)| *key == 1.0 && *value == "value1"));
825+
.any(|(_, key, value)| *key == 1.0 && *value == "value1"));
826826
assert!(values
827827
.iter()
828-
.any(|(key, value)| *key == 2.0 && *value == "value2"));
828+
.any(|(_, key, value)| *key == 2.0 && *value == "value2"));
829829
}
830830

831831
#[test]

rust/blockstore/src/types/reader.rs

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,9 @@ impl<
6767
&'referred_data self,
6868
prefix_range: PrefixRange,
6969
key_range: KeyRange,
70-
) -> impl Stream<Item = Result<(K, V), Box<dyn ChromaError>>> + 'referred_data + Send
70+
) -> impl Stream<Item = Result<(&'referred_data str, K, V), Box<dyn ChromaError>>>
71+
+ 'referred_data
72+
+ Send
7173
where
7274
PrefixRange: RangeBounds<&'prefix str> + Clone + Send + 'referred_data,
7375
KeyRange: RangeBounds<K> + Clone + Send + 'referred_data,
@@ -92,7 +94,7 @@ impl<
9294
&'referred_data self,
9395
prefix_range: PrefixRange,
9496
key_range: KeyRange,
95-
) -> Result<Vec<(K, V)>, Box<dyn ChromaError>>
97+
) -> Result<Vec<(&'referred_data str, K, V)>, Box<dyn ChromaError>>
9698
where
9799
PrefixRange: RangeBounds<&'prefix str> + Clone,
98100
KeyRange: RangeBounds<K> + Clone,

rust/blockstore/tests/blockfile_writer_test.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -284,8 +284,8 @@ mod tests {
284284
for (blockfile_entry, expected_entry) in
285285
all_entries.iter().zip(ref_last_commit.iter())
286286
{
287-
assert_eq!(blockfile_entry.0, expected_entry.0 .1); // key matches
288-
assert_eq!(blockfile_entry.1, expected_entry.1); // value matches
287+
assert_eq!(blockfile_entry.1, expected_entry.0 .1); // key matches
288+
assert_eq!(blockfile_entry.2, expected_entry.1); // value matches
289289
}
290290
}
291291

0 commit comments

Comments
 (0)