Skip to content

[BUG] Fix race in get_or_create #4568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 20, 2025
Merged

[BUG] Fix race in get_or_create #4568

merged 7 commits into from
May 20, 2025

Conversation

HammadB
Copy link
Collaborator

@HammadB HammadB commented May 18, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Fixes a race in get_or_create where a deleted collection may be returned, or where get_or_create may be inconsistent between collection and metadata, or where you may return a unique constraint violation despite get_or_create semantics. The details of the fix are well commented in the code and I delegate detail to those comments.
    • I added ChromaError to the chroma types, for generic error handling fallbacks
    • This error is propagated all the way through and I validated all cases will be properly reported when trigerred.
  • New functionality
    • None

Test plan

How are these changes tested?
I added a test that stresses the race, it fails on main.

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

None

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

if err != nil {
log.Error("[HAMMAD]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean

@HammadB HammadB marked this pull request as ready for review May 18, 2025 21:41
Copy link
Contributor

propel-code-bot bot commented May 18, 2025

Fix Race Condition in get_or_create for Collection Creation

This PR addresses a race condition in the get_or_create logic for collection creation, where concurrent create and delete operations could result in returning deleted collections, inconsistent metadata, or improper unique constraint violations. The fix introduces a more robust approach utilizing a new InsertOnConflictDoNothing method at the database layer to ensure atomicity, better error handling (including a new concurrent deletion error), and improved error propagation across Go, Python, and Rust interfaces. Extensive documentation in the code outlines edge condition handling, and a multi-threaded stress test is added to validate behavior under concurrency.

Key Changes:
• Refactored get_or_create logic in Go (table_catalog.go) to use InsertOnConflictDoNothing for safe concurrent handling.
• Added new InsertOnConflictDoNothing method for ICollectionDb and implemented in collection DAO and mocks.
• Introduced ErrConcurrentDeleteCollection error and corresponding GRPC error mapping for aborted operations.
• Propagated new ChromaError type for generic error handling in Python and extended error mapping in Rust.
• Added robust multi-threaded API test (test_multithreaded_get_or_create) that fails on main and passes here.
• Updated various error handling paths and documentation comments for clarity on transactional isolation and races.

Affected Areas:
• Go coordinator and metastore DB: table_catalog.go, dbmodel/collection.go, dao/collection.go, errors.go, grpc/collection_service.go, grpcutils/response.go, dbmodel/mocks/ICollectionDb.go
• Python API: errors.py, test/api/test_collection.py
• Rust types and sysdb: api_types.rs, sysdb.rs
• Test framework: stress test in Python

Potential Impact:

Functionality: Improves correctness for concurrent collection creation and deletion; prevents inconsistent metadata, deleted-collection returns, and unwarranted constraint errors.

Performance: Introduces slight additional logic in get_or_create, but preserves happy path efficiency; potential for further optimization if InsertOnConflictDoNothing always used.

Security: No direct impact.

Scalability: Improves robustness for high-concurrency environments and bulk user scenarios.

Review Focus:
• Correctness and atomicity in new get_or_create implementation.
• Edge-case handling when concurrent delete occurs between insert and get steps.
• Database transaction boundaries and error propagation.
• Error code mapping and user-facing error clarity.
• Coverage and adequacy of new multi-threaded test.

Testing Needed

• Run full test suite, especially chromadb/test/api/test_collection.py and new multithreaded test.
• Test concurrent get_or_create and deletes for various collection names.
• Check behavior under both Python and Rust clients for error propagation.
• Manually verify database state consistency if possible.

Code Quality Assessment

go/pkg/sysdb/metastore/db/dbmodel/mocks/ICollectionDb.go: Regenerated to match new DAO signature.

go/pkg/sysdb/coordinator/table_catalog.go: Good use of comments for possible races, clear error handling, but complexity increased—must verify readability.

go/pkg/sysdb/metastore/db/dao/collection.go: New conflict-aware insert is correctly guarded and documented.

chromadb/test/api/test_collection.py: Test is thorough and stresses concurrency, use of ChromaError is robust.

rust/types/src/api_types.rs: Correct error typing and mapping.

Best Practices

Concurrency Handling:
• transactional safety
• idempotent operations
• error code clarity

Testing:
• stress/multithreaded testing
• cross-language integration tests

Documentation:
• inline comments explaining reasoning and limitations

Potential Issues

• Commented timing assumption regarding hard deletes could pose risk if deletion window changes.
• InsertOnConflictDoNothing relies on unique index—any DB schema change could break guarantees.
• Log/GRPC error handling for concurrent deletes must be reviewed for user clarity.
• Some code duplication remains between get_or_create and create_collection paths.

This summary was automatically generated by @propel-code-bot

if createCollection.GetOrCreate {
created, err = tc.metaDomain.CollectionDb(txCtx).InsertOnConflictDoNothing(dbCollection)
} else {
err = tc.metaDomain.CollectionDb(txCtx).Insert(dbCollection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this lead to inconsistency? For e.g.

  1. Thread 1 calls create_collection and inserts a row
  2. Thread 2 calls delete and deletes the row
  3. Thread 3 calls create_collection and creates a row
  4. Thread 1 now proceeds to insert metadata and inserts the metadata

Thus collection metadata is of thread 1 whereas collection info is of thread 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline that we are ok with timing assumption that hard delete can't happen in (2)

@HammadB HammadB enabled auto-merge (squash) May 19, 2025 23:03
@HammadB HammadB disabled auto-merge May 19, 2025 23:13
@HammadB HammadB enabled auto-merge (squash) May 20, 2025 00:16
@HammadB HammadB merged commit 730ed06 into main May 20, 2025
70 checks passed
chroma-droid pushed a commit that referenced this pull request May 20, 2025
## Description of changes

_Summarize the changes made by this PR._

- Improvements & Bug fixes
- Fixes a race in get_or_create where a deleted collection may be
returned, or where get_or_create may be inconsistent between collection
and metadata, or where you may return a unique constraint violation
despite get_or_create semantics. The details of the fix are well
commented in the code and I delegate detail to those comments.
- I added `ChromaError` to the chroma types, for generic error handling
fallbacks
- This error is propagated all the way through and I validated all cases
will be properly reported when trigerred.
- New functionality
  - None

## Test plan

_How are these changes tested?_
I added a test that stresses the race, it fails on main.
- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust

## Documentation Changes
None
HammadB added a commit that referenced this pull request May 20, 2025
This PR cherry-picks the commit 730ed06
onto rc/2025-05-16. If there are unresolved conflicts, please resolve
them manually.

Co-authored-by: Hammad Bashir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants