Skip to content

[ENH] Wireup regex in filter operator #4452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 9, 2025

Conversation

Sicheng-Pan
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan commented May 5, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Return prefix as part of get_range method
  • New functionality
    • Wire up regex evaluation process in filter operator

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?


Implementing Regex Matching in Filter Operator

This PR implements regex evaluation within the filter operator, enabling document-level regex matching. It changes blockfile reader APIs to expose prefixes as part of query results and leverages full-text indexes for more efficient regex evaluation. The implementation includes optimizations for exact pattern matching that avoids re-scanning documents when possible.

Key Changes:
• Add regex pattern evaluation in filter operator using ChromaRegex from chroma_types
• Update blockfile reader APIs to return prefixes along with keys and values
• Implement NgramLiteralProvider for FullTextIndexReader to optimize regex searches
• Add benchmark for regex matching with various patterns
• Return prefix as part of get_range and get_range_stream methods in blockstore

Affected Areas:
• Worker filter operator implementation
• Blockstore reader/writer API
FullText index implementation
• Distributed segment implementation
• Metadata segment implementation

Potential Impact:

Functionality: Adds regex matching capability for document filtering, providing users with more powerful query options

Performance: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation, improving search performance for complex patterns

Security: No significant security implications

Scalability: Efficient regex implementation should maintain good performance as document collections grow

Review Focus:
• Regex implementation efficiency in filter.rs
• Error handling for regex pattern compilation and matching
API changes to return prefix in blockstore methods
• Benchmark methodology for regex performance evaluation

Testing Needed

• Benchmark different regex patterns against large document collections
• Test complex regex patterns with edge cases
• Verify consistent results between optimized and brute-force implementations
• Test with documents of varying sizes and content

Code Quality Assessment

rust/worker/src/execution/operators/filter.rs: Well-structured implementation with good error handling, but has a potential performance bottleneck with sequential document fetching

rust/blockstore/src/memory/reader_writer.rs: Multiple unwrap() calls could be replaced with expect() for better error messages

rust/index/src/fulltext/types.rs: Possible lifetime parameter issues in the lookup_ngram_range method

Best Practices

Performance:
• Consider parallelizing document fetches with buffer_unordered for concurrent I/O

Error Handling:
• Replace unwrap() with expect() for better error messages
• Consider proper error propagation instead of unwrapping

Possible Issues

• Potential panics from unwrap() calls in several places
• Sequential document fetching in filter operator might cause performance issues with large result sets
• Lifetime parameter conflicts in lookup_ngram_range method
• Fallback bitmap handling in assertions could cause test failures


This summary was automatically generated by @propel-code-bot

Copy link
Contributor Author

Sicheng-Pan commented May 5, 2025

Copy link

github-actions bot commented May 5, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-02-_enh_implement_literal_provider branch from 3e22395 to f844158 Compare May 6, 2025 18:25
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch 3 times, most recently from e0562e1 to a0b12e4 Compare May 6, 2025 22:48
@Sicheng-Pan Sicheng-Pan mentioned this pull request May 6, 2025
1 task
@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review May 7, 2025 00:20
Copy link
Contributor

propel-code-bot bot commented May 7, 2025

Implementing Regex Matching in Filter Operator

This PR adds regex pattern matching capabilities to the filter operator, enabling document-level regex filtering. It modifies the blockfile reader API to return prefixes along with keys and values, and implements an optimized regex evaluation process that leverages full-text indexes for preliminary candidate selection before performing regex validation. This significantly improves performance for regex queries by avoiding unnecessary document scanning when possible.

Key Changes:
• Add regex pattern evaluation in filter operator using ChromaRegex from chroma_types
• Update blockfile reader API to return prefixes as part of query results
• Implement NgramLiteralProvider for FullTextIndexReader to optimize regex searches
• Add benchmark for regex matching with various patterns
• Optimize exact pattern matching to avoid re-scanning when possible

Affected Areas:
• Worker filter operator implementation
• Blockstore reader/writer API
• FullText index implementation
• Distributed segment implementation
• Metadata segment implementation

Potential Impact:

Functionality: Adds regex matching capability for document filtering, providing users with more powerful query options

Performance: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation

Security: No significant security implications

Scalability: Efficient regex implementation should maintain good performance as document collections grow

Review Focus:
• Regex implementation efficiency in filter.rs
• Error handling for regex pattern compilation and matching
• API changes to return prefix in blockstore methods
• Logic for optimizing regex searches using full-text index
• Performance characteristics of the implementation

Testing Needed

• Benchmark different regex patterns against large document collections
• Test complex regex patterns with edge cases
• Verify consistent results between optimized and brute-force implementations
• Test with documents of varying sizes and content

Code Quality Assessment

rust/worker/src/execution/operators/filter.rs: Good implementation with clear error handling, but could benefit from parallelizing document fetches

rust/index/src/fulltext/types.rs: Well-structured but contains unwrap() calls that could be replaced with expect() for better error messages

rust/blockstore/src/memory/reader_writer.rs: Multiple unwrap() calls that should be replaced with more robust error handling

rust/blockstore/src/arrow/block/types.rs: Contains unwraps on downcast operations that could lead to panics if the column type changes

Best Practices

Performance:
• Consider parallelizing document fetches with buffer_unordered for concurrent I/O

Error Handling:
• Replace unwrap() with expect() for better error messages
• Consider proper error propagation instead of unwrapping

Potential Issues

• Sequential document fetching in filter operator could be a performance bottleneck with large result sets
• Potential lifetimes conflict in lookup_ngram_range method signature
• Logic issue when FTI provides an exact match but no candidates - currently returns all documents
• Multiple unwrap() calls throughout the codebase could lead to runtime panics

This summary was automatically generated by @propel-code-bot

@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from ed9313a to cf231ca Compare May 7, 2025 22:44
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from cf231ca to 56cab48 Compare May 7, 2025 22:54
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch 2 times, most recently from 8526d51 to f27a28c Compare May 8, 2025 01:50
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch 2 times, most recently from 3b34869 to 49868c8 Compare May 8, 2025 18:25
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from 49868c8 to 5664732 Compare May 8, 2025 21:45
@Sicheng-Pan Sicheng-Pan mentioned this pull request May 8, 2025
1 task
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from 4bd67f2 to eb11217 Compare May 9, 2025 00:03
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-02-_enh_implement_literal_provider branch from b8c6c1c to 1b2e06c Compare May 9, 2025 00:03
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from eb11217 to e6012cf Compare May 9, 2025 00:33
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-02-_enh_implement_literal_provider branch from 1b2e06c to 966acb9 Compare May 9, 2025 00:33
Copy link
Contributor Author

Sicheng-Pan commented May 9, 2025

Merge activity

  • May 9, 12:59 PM EDT: A user started a stack merge that includes this pull request via Graphite.
  • May 9, 1:04 PM EDT: The Graphite merge of this pull request was cancelled.
  • May 9, 1:04 PM EDT: A user started a stack merge that includes this pull request via Graphite.
  • May 9, 1:06 PM EDT: Graphite rebased this pull request as part of a merge.
  • May 9, 1:07 PM EDT: @Sicheng-Pan merged this pull request with Graphite.

@Sicheng-Pan Sicheng-Pan changed the base branch from sicheng/05-02-_enh_implement_literal_provider to graphite-base/4452 May 9, 2025 17:00
@Sicheng-Pan Sicheng-Pan changed the base branch from graphite-base/4452 to main May 9, 2025 17:05
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch from e6012cf to 0838c8f Compare May 9, 2025 17:05
@Sicheng-Pan Sicheng-Pan merged commit 225f76f into main May 9, 2025
4 checks passed
itaismith pushed a commit that referenced this pull request May 23, 2025
## Description of changes

_Summarize the changes made by this PR._

- Improvements & Bug fixes
  - Return prefix as part of `get_range` method
- New functionality
  - Wire up regex evaluation process in filter operator

## Test plan

_How are these changes tested?_

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_
<!-- Summary by @propel-code-bot -->

---

**Implementing Regex Matching in Filter Operator**

This PR implements regex evaluation within the filter operator, enabling document-level regex matching. It changes blockfile reader APIs to expose prefixes as part of query results and leverages full-text indexes for more efficient regex evaluation. The implementation includes optimizations for exact pattern matching that avoids re-scanning documents when possible.

**Key Changes:**
• Add regex pattern evaluation in filter operator using `ChromaRegex` from `chroma_types`
• Update blockfile reader `APIs` to return prefixes along with keys and values
• Implement `NgramLiteralProvider` for `FullTextIndexReader` to optimize regex searches
• Add benchmark for regex matching with various patterns
• Return prefix as part of `get_range` and `get_range_stream` methods in blockstore


**Affected Areas:**
• Worker filter operator implementation
• Blockstore reader/writer ``API``
• `FullText` index implementation
• Distributed segment implementation
• Metadata segment implementation


**Potential Impact:**

**Functionality**: Adds regex matching capability for document filtering, providing users with more powerful query options

**Performance**: Optimizes regex matching by using full-text indexes for preliminary candidate selection before regex validation, improving search performance for complex patterns

**Security**: No significant security implications

**Scalability**: Efficient regex implementation should maintain good performance as document collections grow

**Review Focus:**
• Regex implementation efficiency in filter.rs
• Error handling for regex pattern compilation and matching
• ``API`` changes to return prefix in blockstore methods
• Benchmark methodology for regex performance evaluation


<details>
<summary><strong>Testing Needed</strong></summary>

• Benchmark different regex patterns against large document collections
• Test complex regex patterns with edge cases
• Verify consistent results between optimized and brute-force implementations
• Test with documents of varying sizes and content

</details>

<details>
<summary><strong>Code Quality Assessment</strong></summary>

**rust/worker/src/execution/operators/filter.rs**: Well-structured implementation with good error handling, but has a potential performance bottleneck with sequential document fetching

**rust/blockstore/src/memory/reader_writer.rs**: Multiple unwrap() calls could be replaced with expect() for better error messages

**rust/index/src/fulltext/types.rs**: Possible lifetime parameter issues in the lookup_ngram_range method


</details>

<details>
<summary><strong>Best Practices</strong></summary>

**Performance**:
• Consider parallelizing document fetches with `buffer_unordered` for concurrent I/O

**Error Handling**:
• Replace unwrap() with expect() for better error messages
• Consider proper error propagation instead of unwrapping


</details>

<details>
<summary><strong>Possible Issues</strong></summary>

• Potential panics from unwrap() calls in several places
• Sequential document fetching in filter operator might cause performance issues with large result sets
• Lifetime parameter conflicts in lookup_ngram_range method
• Fallback bitmap handling in assertions could cause test failures

</details>

---
*This summary was automatically generated by @propel-code-bot*
@Sicheng-Pan Sicheng-Pan deleted the sicheng/05-05-_enh_wireup_regex_in_filter_operator branch May 28, 2025 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants