-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[BUG] Fix race in get_or_create #4568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
if err != nil { | ||
log.Error("[HAMMAD]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean
Fix Race Condition in get_or_create for Collection Creation This PR addresses a race condition in the get_or_create logic for collection creation, where concurrent create and delete operations could result in returning deleted collections, inconsistent metadata, or improper unique constraint violations. The fix introduces a more robust approach utilizing a new InsertOnConflictDoNothing method at the database layer to ensure atomicity, better error handling (including a new concurrent deletion error), and improved error propagation across Go, Python, and Rust interfaces. Extensive documentation in the code outlines edge condition handling, and a multi-threaded stress test is added to validate behavior under concurrency. Key Changes: Affected Areas: Potential Impact: Functionality: Improves correctness for concurrent collection creation and deletion; prevents inconsistent metadata, deleted-collection returns, and unwarranted constraint errors. Performance: Introduces slight additional logic in get_or_create, but preserves happy path efficiency; potential for further optimization if InsertOnConflictDoNothing always used. Security: No direct impact. Scalability: Improves robustness for high-concurrency environments and bulk user scenarios. Review Focus: Testing Needed• Run full test suite, especially chromadb/test/api/ Code Quality Assessmentgo/pkg/sysdb/metastore/db/dbmodel/mocks/ICollectionDb.go: Regenerated to match new DAO signature. go/pkg/sysdb/coordinator/table_catalog.go: Good use of comments for possible races, clear error handling, but complexity increased—must verify readability. go/pkg/sysdb/metastore/db/dao/collection.go: New conflict-aware insert is correctly guarded and documented. chromadb/test/api/test_collection.py: Test is thorough and stresses concurrency, use of ChromaError is robust. rust/types/src/api_types.rs: Correct error typing and mapping. Best PracticesConcurrency Handling: Testing: Documentation: Potential Issues• Commented timing assumption regarding hard deletes could pose risk if deletion window changes. This summary was automatically generated by @propel-code-bot |
if createCollection.GetOrCreate { | ||
created, err = tc.metaDomain.CollectionDb(txCtx).InsertOnConflictDoNothing(dbCollection) | ||
} else { | ||
err = tc.metaDomain.CollectionDb(txCtx).Insert(dbCollection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this lead to inconsistency? For e.g.
- Thread 1 calls create_collection and inserts a row
- Thread 2 calls delete and deletes the row
- Thread 3 calls create_collection and creates a row
- Thread 1 now proceeds to insert metadata and inserts the metadata
Thus collection metadata is of thread 1 whereas collection info is of thread 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that we are ok with timing assumption that hard delete can't happen in (2)
## Description of changes _Summarize the changes made by this PR._ - Improvements & Bug fixes - Fixes a race in get_or_create where a deleted collection may be returned, or where get_or_create may be inconsistent between collection and metadata, or where you may return a unique constraint violation despite get_or_create semantics. The details of the fix are well commented in the code and I delegate detail to those comments. - I added `ChromaError` to the chroma types, for generic error handling fallbacks - This error is propagated all the way through and I validated all cases will be properly reported when trigerred. - New functionality - None ## Test plan _How are these changes tested?_ I added a test that stresses the race, it fails on main. - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes None
This PR cherry-picks the commit 730ed06 onto rc/2025-05-16. If there are unresolved conflicts, please resolve them manually. Co-authored-by: Hammad Bashir <[email protected]>
Description of changes
Summarize the changes made by this PR.
ChromaError
to the chroma types, for generic error handling fallbacksTest plan
How are these changes tested?
I added a test that stresses the race, it fails on main.
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
None