-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ENH]: perform collection hard deletes from garbage collector #4605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
5b21248
to
5b6a52b
Compare
f545524
to
822e31d
Compare
5b6a52b
to
65e13b1
Compare
822e31d
to
afe5084
Compare
65e13b1
to
9ab64f5
Compare
afe5084
to
9aed3e4
Compare
8aff535
to
d78d710
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in this file provide a little more debugging info + a new defensive check
9aed3e4
to
ca845a3
Compare
720af86
to
c815632
Compare
Enable Automatic Hard Deletion of Collections via Garbage Collector This PR implements major architectural changes to the garbage collection system in Chroma, allowing the orchestrator to perform hard deletion of collections that were previously soft deleted and have become eligible for permanent removal. The change propagates through several crates, introducing new status query APIs, collection dependency handling, data structure extensions, and test fixture modifications, with particular focus on guaranteeing correctness for forked/dependent collections and their deletion lifecycles. Key Changes: Affected Areas: Potential Impact: Functionality: Transitions deletion from manual/user-initiated steps to a fully orchestrated, topologically correct hard delete after all dependencies are met, affecting retention semantics, lifecycle guarantees, and overall data integrity for collections and their descendants. Performance: Minimal direct perf impact; some additional sysdb queries and graph traversals may add overhead for large fork trees, but these are generally asynchronous/batch operations. Security: Correct erasure of hard-deleted collections prevents data retention leaks; security posture enhanced if orchestrated properly. Scalability: Imposes a dependency graph and topological sort for deletes, but approach is inherently scalable as all edges are explicit. Soft delete batch queries scale with the number of collections eligible for GC. Review Focus: Testing Needed• Run full proptest suite to validate all new transition states and invariants, especially hard-deletion propagation through collection forks and dependencies. Code Quality Assessmentrust/sysdb/src/sysdb.rs: API extended cleanly; method naming is consistent. Implementation for 'finish_collection_deletion' and batch queries follows Rust async conventions. rust/garbage_collector/src/operators/compute_versions_to_delete_from_graph.rs: Logic extended to support soft-deleted collections; retains clear separation of concerns and remains testable. rust/garbage_collector/tests/proptest_helpers/garbage_collector_reference.rs: Substantial state and transition refactor, strong invariants, and testability improved. rust/garbage_collector/src/construct_version_graph_orchestrator.rs: Structural improvements with defensive checks; code cleanliness maintained. others: Internal type wiring and glue code follow Rust conventions. Test and ref code are verbose but structured. rust/garbage_collector/src/garbage_collector_orchestrator_v2.rs: Substantial new logic, clear structure, good use of error handling and tracing. Functions remain within idiomatic bounds. Watch for error propagation and partial state updates. Best PracticesTesting: Modularization: Error Handling: Documentation: Potential Issues• If a fork dependency graph is large and deeply nested, the toposort and eligibility checks could be computationally heavier than before. This summary was automatically generated by @propel-code-bot |
ca845a3
to
ccc9b82
Compare
c815632
to
d56d085
Compare
716f29a
to
9356e10
Compare
ed65c55
to
230d5ac
Compare
9356e10
to
8253c18
Compare
230d5ac
to
cf6858c
Compare
8253c18
to
e1a6311
Compare
cf6858c
to
4baa298
Compare
c85ee20
to
c0914e0
Compare
4baa298
to
114728f
Compare
c0914e0
to
b2befca
Compare
114728f
to
24d7092
Compare
b2befca
to
950e307
Compare
e949e65
to
0da7dfa
Compare
950e307
to
d1c4e49
Compare
0da7dfa
to
7de8859
Compare
7de8859
to
719d31b
Compare
d1c4e49
to
3d65425
Compare
cdc60db
to
e167434
Compare
e167434
to
3f81298
Compare
})?; | ||
|
||
for collection_id in topo.iter().rev() { | ||
let are_all_children_soft_deleted = petgraph::algo::dijkstra( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we iterate in reverse topological order and soft delete the collections, it should be guaranteed that all children are soft deleted in each step right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because the graph is comprised of both alive and soft deleted collections
we could break out of the loop on the first alive collection we see (which I think is what you're saying?) but then it's possible to miss collections eligible for hard deletion
Description of changes
This updates the garbage collector to transition soft deleted collections to hard deleted when eligible.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustI updated the proptest with a new transition for deleting collections.
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?
n/a