Skip to content

[BUG]: don't try reading HNSW files as sparse indices when performing garbage collection on SPANN collections #4401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Apr 29, 2025

Description of changes

Collections that use a HNSW index have a segment file path called hnsw_index. However, collections with a SPANN index have a segment file path called hnsw_path. Currently garbage collection logic treats hnsw_index as HNSW files but was treating hnsw_path as sparse indices because we never match on that file type name.

This PR fixes the logic so that both hnsw_index and hnsw_path file types are treated as HNSW files.

Test plan

How are these changes tested?

Added test fails on main.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

n/a

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb marked this pull request as ready for review April 29, 2025 20:38
@codetheweb codetheweb requested a review from Sicheng-Pan April 29, 2025 20:38
@@ -73,7 +74,7 @@ impl ComputeUnusedFilesOperator {
for segment_compaction_info in older_segment_info.segment_compaction_info.iter() {
for (file_type, file_paths) in &segment_compaction_info.file_paths {
// For hnsw_index files, just add it without comparing with newer version.
if file_type == "hnsw_index" {
if file_type == "hnsw_index" || file_type == HNSW_PATH {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we create a enum and use pattern matching here instead? That could be more robust in the future if we introduce more files in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while investigating. This is not straightforward to do as we would need to push the enum into the proto and/or add a wrapper type for CollectionSegmentInfo.

Alternatively, adding just a SegmentFileType enum and then manually constructing it here works, but slightly decreases code quality everywhere else (we now have an enum but it's almost always consumed as SegmentFileType::_.as_str().to_string()) and is really only a stopgap solution.

Ok to file a ticket to add a file type enum to the proto and merge for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codetheweb codetheweb merged commit 12a633c into main Apr 29, 2025
70 checks passed
@codetheweb codetheweb deleted the fix-gc-dont-treat-hnsw-files-as-sparse-indices branch April 29, 2025 23:28
warpbuild-benchmark-bot bot added a commit to WarpBuilds/chroma that referenced this pull request Apr 29, 2025
chroma-droid pushed a commit that referenced this pull request Apr 29, 2025
… garbage collection on SPANN collections (#4401)

## Description of changes

Collections that use a HNSW index have a segment file path called
`hnsw_index`. However, collections with a SPANN index have a segment
file path called `hnsw_path`. Currently garbage collection logic treats
`hnsw_index` as HNSW files but was treating `hnsw_path` as sparse
indices because we never match on that file type name.

This PR fixes the logic so that both `hnsw_index` and `hnsw_path` file
types are treated as HNSW files.

## Test plan

_How are these changes tested?_

Added test fails on main.

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
eculver pushed a commit that referenced this pull request Apr 30, 2025
This PR cherry-picks the commit 12a633c
onto release/2025-04-25. If there are unresolved conflicts, please
resolve them manually.

Co-authored-by: Max Isom <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants