-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[BUG]: don't try reading HNSW files as sparse indices when performing garbage collection on SPANN collections #4401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: don't try reading HNSW files as sparse indices when performing garbage collection on SPANN collections #4401
Conversation
… garbage collection on SPANN collections
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
@@ -73,7 +74,7 @@ impl ComputeUnusedFilesOperator { | |||
for segment_compaction_info in older_segment_info.segment_compaction_info.iter() { | |||
for (file_type, file_paths) in &segment_compaction_info.file_paths { | |||
// For hnsw_index files, just add it without comparing with newer version. | |||
if file_type == "hnsw_index" { | |||
if file_type == "hnsw_index" || file_type == HNSW_PATH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we create a enum and use pattern matching here instead? That could be more robust in the future if we introduce more files in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a while investigating. This is not straightforward to do as we would need to push the enum into the proto and/or add a wrapper type for CollectionSegmentInfo
.
Alternatively, adding just a SegmentFileType
enum and then manually constructing it here works, but slightly decreases code quality everywhere else (we now have an enum but it's almost always consumed as SegmentFileType::_.as_str().to_string()
) and is really only a stopgap solution.
Ok to file a ticket to add a file type enum to the proto and merge for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a ticket and merging.
https://github.com/chroma-core/hosted-chroma/issues/2508
… garbage collection on SPANN collections (chroma-core#4401)
… garbage collection on SPANN collections (#4401) ## Description of changes Collections that use a HNSW index have a segment file path called `hnsw_index`. However, collections with a SPANN index have a segment file path called `hnsw_path`. Currently garbage collection logic treats `hnsw_index` as HNSW files but was treating `hnsw_path` as sparse indices because we never match on that file type name. This PR fixes the logic so that both `hnsw_index` and `hnsw_path` file types are treated as HNSW files. ## Test plan _How are these changes tested?_ Added test fails on main. ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_ n/a
This PR cherry-picks the commit 12a633c onto release/2025-04-25. If there are unresolved conflicts, please resolve them manually. Co-authored-by: Max Isom <[email protected]>
Description of changes
Collections that use a HNSW index have a segment file path called
hnsw_index
. However, collections with a SPANN index have a segment file path calledhnsw_path
. Currently garbage collection logic treatshnsw_index
as HNSW files but was treatinghnsw_path
as sparse indices because we never match on that file type name.This PR fixes the logic so that both
hnsw_index
andhnsw_path
file types are treated as HNSW files.Test plan
How are these changes tested?
Added test fails on main.
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?
n/a