-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH] Wireup regex in filter operator #4452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Wireup regex in filter operator #4452
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
3e22395
to
f844158
Compare
e0562e1
to
a0b12e4
Compare
Implementing Regex Matching in Filter Operator This PR adds regex pattern matching capabilities to the filter operator, enabling document-level regex filtering. It modifies the blockfile reader API to return prefixes along with keys and values, and implements an optimized regex evaluation process that leverages full-text indexes for preliminary candidate selection before performing regex validation. This significantly improves performance for regex queries by avoiding unnecessary document scanning when possible. Key Changes: Affected Areas: Potential Impact: Functionality: Adds regex matching capability for document filtering, providing users with more powerful query options Performance: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation Security: No significant security implications Scalability: Efficient regex implementation should maintain good performance as document collections grow Review Focus: Testing Needed• Benchmark different regex patterns against large document collections Code Quality Assessmentrust/worker/src/execution/operators/filter.rs: Good implementation with clear error handling, but could benefit from parallelizing document fetches rust/index/src/fulltext/types.rs: Well-structured but contains unwrap() calls that could be replaced with expect() for better error messages rust/blockstore/src/memory/reader_writer.rs: Multiple unwrap() calls that should be replaced with more robust error handling rust/blockstore/src/arrow/block/types.rs: Contains unwraps on downcast operations that could lead to panics if the column type changes Best PracticesPerformance: Error Handling: Potential Issues• Sequential document fetching in filter operator could be a performance bottleneck with large result sets This summary was automatically generated by @propel-code-bot |
ed9313a
to
cf231ca
Compare
cf231ca
to
56cab48
Compare
8526d51
to
f27a28c
Compare
3b34869
to
49868c8
Compare
49868c8
to
5664732
Compare
4bd67f2
to
eb11217
Compare
b8c6c1c
to
1b2e06c
Compare
eb11217
to
e6012cf
Compare
1b2e06c
to
966acb9
Compare
Merge activity
|
e6012cf
to
0838c8f
Compare
## Description of changes _Summarize the changes made by this PR._ - Improvements & Bug fixes - Return prefix as part of `get_range` method - New functionality - Wire up regex evaluation process in filter operator ## Test plan _How are these changes tested?_ - [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ <!-- Summary by @propel-code-bot --> --- **Implementing Regex Matching in Filter Operator** This PR implements regex evaluation within the filter operator, enabling document-level regex matching. It changes blockfile reader APIs to expose prefixes as part of query results and leverages full-text indexes for more efficient regex evaluation. The implementation includes optimizations for exact pattern matching that avoids re-scanning documents when possible. **Key Changes:** • Add regex pattern evaluation in filter operator using `ChromaRegex` from `chroma_types` • Update blockfile reader `APIs` to return prefixes along with keys and values • Implement `NgramLiteralProvider` for `FullTextIndexReader` to optimize regex searches • Add benchmark for regex matching with various patterns • Return prefix as part of `get_range` and `get_range_stream` methods in blockstore **Affected Areas:** • Worker filter operator implementation • Blockstore reader/writer ``API`` • `FullText` index implementation • Distributed segment implementation • Metadata segment implementation **Potential Impact:** **Functionality**: Adds regex matching capability for document filtering, providing users with more powerful query options **Performance**: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation, improving search performance for complex patterns **Security**: No significant security implications **Scalability**: Efficient regex implementation should maintain good performance as document collections grow **Review Focus:** • Regex implementation efficiency in filter.rs • Error handling for regex pattern compilation and matching • ``API`` changes to return prefix in blockstore methods • Benchmark methodology for regex performance evaluation <details> <summary><strong>Testing Needed</strong></summary> • Benchmark different regex patterns against large document collections • Test complex regex patterns with edge cases • Verify consistent results between optimized and brute-force implementations • Test with documents of varying sizes and content </details> <details> <summary><strong>Code Quality Assessment</strong></summary> **rust/worker/src/execution/operators/filter.rs**: Well-structured implementation with good error handling, but has a potential performance bottleneck with sequential document fetching **rust/blockstore/src/memory/reader_writer.rs**: Multiple unwrap() calls could be replaced with expect() for better error messages **rust/index/src/fulltext/types.rs**: Possible lifetime parameter issues in the lookup_ngram_range method </details> <details> <summary><strong>Best Practices</strong></summary> **Performance**: • Consider parallelizing document fetches with `buffer_unordered` for concurrent I/O **Error Handling**: • Replace unwrap() with expect() for better error messages • Consider proper error propagation instead of unwrapping </details> <details> <summary><strong>Possible Issues</strong></summary> • Potential panics from unwrap() calls in several places • Sequential document fetching in filter operator might cause performance issues with large result sets • Lifetime parameter conflicts in lookup_ngram_range method • Fallback bitmap handling in assertions could cause test failures </details> --- *This summary was automatically generated by @propel-code-bot*
Description of changes
Summarize the changes made by this PR.
get_range
methodTest plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?
Implementing Regex Matching in Filter Operator
This PR implements regex evaluation within the filter operator, enabling document-level regex matching. It changes blockfile reader APIs to expose prefixes as part of query results and leverages full-text indexes for more efficient regex evaluation. The implementation includes optimizations for exact pattern matching that avoids re-scanning documents when possible.
Key Changes:
• Add regex pattern evaluation in filter operator using
ChromaRegex
fromchroma_types
• Update blockfile reader
APIs
to return prefixes along with keys and values• Implement
NgramLiteralProvider
forFullTextIndexReader
to optimize regex searches• Add benchmark for regex matching with various patterns
• Return prefix as part of
get_range
andget_range_stream
methods in blockstoreAffected Areas:
• Worker filter operator implementation
• Blockstore reader/writer
API
•
FullText
index implementation• Distributed segment implementation
• Metadata segment implementation
Potential Impact:
Functionality: Adds regex matching capability for document filtering, providing users with more powerful query options
Performance: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation, improving search performance for complex patterns
Security: No significant security implications
Scalability: Efficient regex implementation should maintain good performance as document collections grow
Review Focus:
• Regex implementation efficiency in filter.rs
• Error handling for regex pattern compilation and matching
•
API
changes to return prefix in blockstore methods• Benchmark methodology for regex performance evaluation
Testing Needed
• Benchmark different regex patterns against large document collections
• Test complex regex patterns with edge cases
• Verify consistent results between optimized and brute-force implementations
• Test with documents of varying sizes and content
Code Quality Assessment
rust/worker/src/execution/operators/filter.rs: Well-structured implementation with good error handling, but has a potential performance bottleneck with sequential document fetching
rust/blockstore/src/memory/reader_writer.rs: Multiple unwrap() calls could be replaced with expect() for better error messages
rust/index/src/fulltext/types.rs: Possible lifetime parameter issues in the lookup_ngram_range method
Best Practices
Performance:
• Consider parallelizing document fetches with
buffer_unordered
for concurrent I/OError Handling:
• Replace unwrap() with expect() for better error messages
• Consider proper error propagation instead of unwrapping
Possible Issues
• Potential panics from unwrap() calls in several places
• Sequential document fetching in filter operator might cause performance issues with large result sets
• Lifetime parameter conflicts in lookup_ngram_range method
• Fallback bitmap handling in assertions could cause test failures
This summary was automatically generated by @propel-code-bot