You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that Parquet is doing something clever when compressing vectors, resulting in rather small file sizes for sparse and low-cardinality vectors:
import numpy as np
import pandas as pd
from usearch.index import Index, MetricKind
np.random.seed(42)
file_pq = "test10Kx1024uint8.parquet"
file_usearch = "test10Kx1024uint8.usearch"
max_value = 4
N = 10_000
df = pd.DataFrame({
"vec": [v for v in np.random.randint(max_value, size=(N, 1024), dtype=np.uint8)]
})
df.to_parquet(file_pq, compression="snappy")
index = Index(ndim=1024, dtype="i8", metric=MetricKind.L2sq)
index.add(df.index, np.stack(df.vec.values))
index.save(file_usearch)
#!stat --printf '%s %n\n' test*
#
# max_value = 256:
# 10266482 test10Kx1024uint8.parquet
# 11725264 test10Kx1024uint8.usearch
#
# max_value = 4:
# 2584703 test10Kx1024uint8.parquet
# 11725264 test10Kx1024uint8.usearch
Is there an opportunity to tweak the storage format to make storage more efficient?
Maybe by leveraging Parquet + writing extra attributes necessary for the index to work?
Can you contribute to the implementation?
I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
I have searched the existing issues
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Describe what you are looking for
It appears that Parquet is doing something clever when compressing vectors, resulting in rather small file sizes for sparse and low-cardinality vectors:
Is there an opportunity to tweak the storage format to make storage more efficient?
Maybe by leveraging Parquet + writing extra attributes necessary for the index to work?
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
Code of Conduct
The text was updated successfully, but these errors were encountered: