Skip to content

[CLN] Rename $matches to $regex #4506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 9, 2025

Conversation

Sicheng-Pan
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan commented May 8, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Renames $matches and $not_matches to $regex and $not_regex
  • New functionality
    • N/A

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link

github-actions bot commented May 8, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

Sicheng-Pan commented May 8, 2025

@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review May 8, 2025 22:01
Copy link
Contributor

propel-code-bot bot commented May 8, 2025

Rename $matches to $regex for Improved Clarity and Optimized Performance

This PR renames document operators from $matches/$not_matches to $regex/$not_regex for better clarity and semantic meaning. It also improves regex matching performance through optimized implementation using HashSet instead of RoaringBitmap, and adds proper validation for regex patterns using a literal expression check.

Key Changes:
• Renamed $matches and $not_matches operators to $regex and $not_regex throughout the codebase
• Refactored regex matching implementation to use HashSet instead of RoaringBitmap for improved performance
• Moved contains_ngram_literal method from ChromaHir to LiteralExpr for better validation
• Fixed validation logic in where_parsing.rs to use the new LiteralExpr method
• Updated all related protobuf definitions and type declarations

Affected Areas:
• Document regex filtering operators in API
• Regex pattern validation logic
• Regex matching implementation in rust/types/src/regex/
• Protocol buffer definitions for document operators
• Python API type definitions and validation

Potential Impact:

Functionality: Functionality remains the same but with improved naming for better semantic understanding. API consumers will need to update from $matches to $regex in their queries.

Performance: Significant performance improvement in regex matching by using HashSet instead of RoaringBitmap, which reduces overhead in handling matching positions.

Security: Improved validation for regex patterns helps prevent overly permissive patterns that could cause performance issues.

Scalability: Better performance in regex matching improves scalability for document search operations, especially for large collections.

Review Focus:
• Renamed operator naming consistency across the codebase
• Implementation of HashSet-based regex matching instead of RoaringBitmap
• Validation logic for regex patterns using LiteralExpr
• Potential performance implications of the implementation changes

Testing Needed

• Verify regex matching still works correctly with the renamed operators
• Benchmark performance of new regex matching implementation
• Test edge cases in regex pattern validation
• Ensure backward compatibility handling for existing users of $matches

Code Quality Assessment

rust/types/src/regex/literal_expr.rs: Added a useful method to check for ngram literals with parametrized max width

rust/worker/src/execution/operators/filter.rs: Improved filtering threshold logic to be proportional to collection size

rust/index/src/fulltext/types.rs: Simplified implementation by avoiding unnecessary copying of position data

Best Practices

Naming:
• Improved semantic clarity by using $regex instead of $matches
• Consistent naming across Python and Rust implementation

Performance:
• Replaced RoaringBitmap with HashSet for more efficient regex matching
• Added dynamic threshold based on collection size for optimized filtering

Code Organization:
• Moved contains_ngram_literal method to LiteralExpr for better encapsulation

Potential Issues

• Breaking change for existing clients using $matches/$not_matches operators
• Need to update documentation to reflect the new operator names
• May need further testing to ensure performance improvements work across different data patterns

This summary was automatically generated by @propel-code-bot

@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-08-_cln_rename_matches_to_regex branch from b5ba11c to f77fb3d Compare May 9, 2025 00:03
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-06-_bug_fix_blockfile_range_scan branch from f2751c5 to c89f762 Compare May 9, 2025 00:03
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-08-_cln_rename_matches_to_regex branch from f77fb3d to e157e69 Compare May 9, 2025 00:33
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-06-_bug_fix_blockfile_range_scan branch from c89f762 to 6902fa0 Compare May 9, 2025 00:33
Copy link
Contributor Author

Sicheng-Pan commented May 9, 2025

Merge activity

  • May 9, 12:59 PM EDT: A user started a stack merge that includes this pull request via Graphite.
  • May 9, 1:04 PM EDT: The Graphite merge of this pull request was cancelled.

@Sicheng-Pan Sicheng-Pan changed the base branch from sicheng/05-06-_bug_fix_blockfile_range_scan to graphite-base/4506 May 9, 2025 17:07
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-08-_cln_rename_matches_to_regex branch from e157e69 to 75486e3 Compare May 9, 2025 19:12
@Sicheng-Pan Sicheng-Pan force-pushed the graphite-base/4506 branch from 6902fa0 to e13f8df Compare May 9, 2025 19:12
@Sicheng-Pan Sicheng-Pan changed the base branch from graphite-base/4506 to main May 9, 2025 19:12
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-08-_cln_rename_matches_to_regex branch from 75486e3 to a32ca6b Compare May 9, 2025 20:03
@Sicheng-Pan Sicheng-Pan merged commit ed884fa into main May 9, 2025
4 checks passed
itaismith pushed a commit that referenced this pull request May 23, 2025
## Description of changes

_Summarize the changes made by this PR._

- Improvements & Bug fixes
  - Renames `$matches` and `$not_matches` to `$regex` and `$not_regex`
- New functionality
  - N/A

## Test plan

_How are these changes tested?_

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_
@Sicheng-Pan Sicheng-Pan deleted the sicheng/05-08-_cln_rename_matches_to_regex branch May 28, 2025 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants