Skip to content

[ENH]: perform collection hard deletes from garbage collector #4605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 29, 2025

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented May 22, 2025

Description of changes

This updates the garbage collector to transition soft deleted collections to hard deleted when eligible.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

I updated the proptest with a new transition for deleting collections.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

n/a

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 5b21248 to 5b6a52b Compare May 22, 2025 17:29
@codetheweb codetheweb changed the base branch from feat-validate-file-paths-during-gc to feat-gc-hard-delete-collection-sysdb May 22, 2025 17:29
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from f545524 to 822e31d Compare May 22, 2025 18:06
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 5b6a52b to 65e13b1 Compare May 22, 2025 18:06
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 822e31d to afe5084 Compare May 22, 2025 18:13
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 65e13b1 to 9ab64f5 Compare May 22, 2025 18:13
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from afe5084 to 9aed3e4 Compare May 22, 2025 20:11
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch 2 times, most recently from 8aff535 to d78d710 Compare May 22, 2025 21:54
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this file provide a little more debugging info + a new defensive check

@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 9aed3e4 to ca845a3 Compare May 22, 2025 22:23
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch 4 times, most recently from 720af86 to c815632 Compare May 22, 2025 23:11
@codetheweb codetheweb changed the title [ENH]: move collection hard deletes to garbage collector [ENH]: perform collection hard deletes from garbage collector May 22, 2025
@codetheweb codetheweb marked this pull request as ready for review May 22, 2025 23:13
@codetheweb codetheweb requested a review from sanketkedia May 22, 2025 23:13
Copy link
Contributor

propel-code-bot bot commented May 22, 2025

Enable Automatic Hard Deletion of Collections via Garbage Collector

This PR implements major architectural changes to the garbage collection system in Chroma, allowing the orchestrator to perform hard deletion of collections that were previously soft deleted and have become eligible for permanent removal. The change propagates through several crates, introducing new status query APIs, collection dependency handling, data structure extensions, and test fixture modifications, with particular focus on guaranteeing correctness for forked/dependent collections and their deletion lifecycles.

Key Changes:
• The garbage collector orchestrator (garbage_collector_orchestrator_v2.rs) now checks for soft-deleted collections and, when eligible (i.e., no dependent/forked live children), invokes hard deletion via sysdb.
• Added collection dependency graph construction in the garbage collector and use of topological order for deletion, ensuring children are deleted before parents.
• New sysdb APIs and gRPC methods (batch_get_collection_soft_delete_status, finish_collection_deletion) to support both querying and performing hard deletes; interfaces extended in sysdb.rs and wire/proto contracts updated.
• ComputeVersionsToDeleteOperator now receives the set of soft-deleted collections, marking all their versions for deletion.
• Proptest/test harnesses and the in-memory reference model were updated for the new collection lifecycle, with additional transitions and invariant checks (e.g., verifying hard deletes truly remove metadata from sysdb).
• Dependency improvements in collection graph construction, test helpers, and structure typing (e.g., separation of soft and hard delete states; explicit collection deletion sentinel values).

Affected Areas:
• garbage_collector_orchestrator_v2.rs
• sysdb.rs (and test_sysdb.rs)
• types.rs (types and API contracts)
• construct_version_graph_orchestrator.rs (defensive checks, graph construction)
• operators/compute_versions_to_delete_from_graph.rs
• proptest_helpers/garbage_collector_reference.rs/garbage_collector_under_test.rs

Potential Impact:

Functionality: Transitions deletion from manual/user-initiated steps to a fully orchestrated, topologically correct hard delete after all dependencies are met, affecting retention semantics, lifecycle guarantees, and overall data integrity for collections and their descendants.

Performance: Minimal direct perf impact; some additional sysdb queries and graph traversals may add overhead for large fork trees, but these are generally asynchronous/batch operations.

Security: Correct erasure of hard-deleted collections prevents data retention leaks; security posture enhanced if orchestrated properly.

Scalability: Imposes a dependency graph and topological sort for deletes, but approach is inherently scalable as all edges are explicit. Soft delete batch queries scale with the number of collections eligible for GC.

Review Focus:
• Correctness of dependency/resolution logic, particularly the topological sort and condition checks for hard delete eligibility.
• Backward compatibility of new sysdb API surfaces, including error handling and return data for new calls.
• Invariants/defensive checks in both garbage collection and graph construction.
• Test coverage and assertions for new collection lifecycle states.
• Impact on distributed/parallel GC scenarios where collection lifecycles may race.

Testing Needed

• Run full proptest suite to validate all new transition states and invariants, especially hard-deletion propagation through collection forks and dependencies.
• Run integration tests to ensure sysdb contains no metadata for hard-deleted collections and that no files or segments remain.
• Validate that cycles and edge cases in collection dependency graphs (e.g., forks with mixed live/dead descendants) result in the correct set of deleted collections.
• Verify soft deletion and eligibility semantics remain correct in both test and mock sysdb implementations.

Code Quality Assessment

rust/sysdb/src/sysdb.rs: API extended cleanly; method naming is consistent. Implementation for 'finish_collection_deletion' and batch queries follows Rust async conventions.

rust/garbage_collector/src/operators/compute_versions_to_delete_from_graph.rs: Logic extended to support soft-deleted collections; retains clear separation of concerns and remains testable.

rust/garbage_collector/tests/proptest_helpers/garbage_collector_reference.rs: Substantial state and transition refactor, strong invariants, and testability improved.

rust/garbage_collector/src/construct_version_graph_orchestrator.rs: Structural improvements with defensive checks; code cleanliness maintained.

others: Internal type wiring and glue code follow Rust conventions. Test and ref code are verbose but structured.

rust/garbage_collector/src/garbage_collector_orchestrator_v2.rs: Substantial new logic, clear structure, good use of error handling and tracing. Functions remain within idiomatic bounds. Watch for error propagation and partial state updates.

Best Practices

Testing:
• Rich property-based/proptest testing with genuine model/reference validation.
• Defensive runtime invariant checks for graph integrity and lifecycle correctness.

Modularization:
• Separation of sysdb, orchestrator, operator, and reference/test code.
• Public APIs and test APIs clearly separated.

Error Handling:
• Consistent error propagation via enums and custom error types.
• Defensive error-to-internal mapping in orchestration.

Documentation:
• Method comment headers and code docstrings present; inline developers notes added on key invariants.

Potential Issues

• If a fork dependency graph is large and deeply nested, the toposort and eligibility checks could be computationally heavier than before.
• Errors or partial failures in sysdb during batch soft delete status queries or hard delete calls may leave collections in inconsistent states unless fully transactional or carefully handled.
• Backward compatibility: Services expecting previous manual delete step must now access sysdb via new APIs for deletion completeness.
• Potential edge cases if a collection's child is restored or recreated between soft-delete and hard-delete steps.

This summary was automatically generated by @propel-code-bot

@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from ca845a3 to ccc9b82 Compare May 22, 2025 23:24
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from c815632 to d56d085 Compare May 22, 2025 23:24
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 716f29a to 9356e10 Compare May 23, 2025 22:55
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch 3 times, most recently from ed65c55 to 230d5ac Compare May 23, 2025 22:56
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 9356e10 to 8253c18 Compare May 23, 2025 23:36
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 230d5ac to cf6858c Compare May 23, 2025 23:36
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 8253c18 to e1a6311 Compare May 27, 2025 17:22
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from cf6858c to 4baa298 Compare May 27, 2025 17:22
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch 2 times, most recently from c85ee20 to c0914e0 Compare May 27, 2025 23:03
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 4baa298 to 114728f Compare May 27, 2025 23:03
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from c0914e0 to b2befca Compare May 27, 2025 23:13
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 114728f to 24d7092 Compare May 27, 2025 23:13
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from b2befca to 950e307 Compare May 28, 2025 00:17
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch 2 times, most recently from e949e65 to 0da7dfa Compare May 28, 2025 19:43
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 950e307 to d1c4e49 Compare May 28, 2025 21:14
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 0da7dfa to 7de8859 Compare May 28, 2025 21:14
@codetheweb codetheweb changed the base branch from feat-gc-hard-delete-collection-sysdb to graphite-base/4605 May 28, 2025 22:08
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from 7de8859 to 719d31b Compare May 28, 2025 22:08
@codetheweb codetheweb force-pushed the graphite-base/4605 branch from d1c4e49 to 3d65425 Compare May 28, 2025 22:08
@graphite-app graphite-app bot changed the base branch from graphite-base/4605 to main May 28, 2025 22:09
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch 2 times, most recently from cdc60db to e167434 Compare May 29, 2025 18:50
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection branch from e167434 to 3f81298 Compare May 29, 2025 21:12
})?;

for collection_id in topo.iter().rev() {
let are_all_children_soft_deleted = petgraph::algo::dijkstra(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we iterate in reverse topological order and soft delete the collections, it should be guaranteed that all children are soft deleted in each step right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because the graph is comprised of both alive and soft deleted collections
we could break out of the loop on the first alive collection we see (which I think is what you're saying?) but then it's possible to miss collections eligible for hard deletion

@codetheweb codetheweb merged commit cdf5e60 into main May 29, 2025
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants