Skip to content

Bug: unsigned uint8 misbehaves when building an index #595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
liquidcarbon opened this issue Apr 22, 2025 · 10 comments
Open
2 of 3 tasks

Bug: unsigned uint8 misbehaves when building an index #595

liquidcarbon opened this issue Apr 22, 2025 · 10 comments
Labels
bug Something isn't working

Comments

@liquidcarbon
Copy link

liquidcarbon commented Apr 22, 2025

Describe the bug

Why does the index and and distance calculations become all zeroes?

Steps to reproduce

index = Index(ndim=3)
a = np.uint8([
    [0, 0, 1],
    [0, 1, 2],
    [1, 2, 3],
])
index.add([0,1,2], a)
for i in range(3):
    print(index[i])
pd.DataFrame([r for r in index.search(a, 4)])

Image

Expected behavior

If you do this with DuckDB:

df = pd.DataFrame({"idx": [0,1,2], "vec": [v for v in a]})
duckdb.sql("""
SELECT a.idx, b.idx, LIST_DISTANCE(a.vec, b.vec)
FROM df a JOIN df b ON 1=1
""").df()

Image

USearch version

2.17.7

Operating System

Amazon Linux

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@liquidcarbon liquidcarbon added the bug Something isn't working label Apr 22, 2025
@ashvardanian
Copy link
Contributor

@liquidcarbon hey! Try explicitly setting the preferred metric and internal representation type in the constructor of the index 🤗

@liquidcarbon
Copy link
Author

liquidcarbon commented Apr 22, 2025

I've tried a few things; neither dtype nor metric seem to make a difference?

Image

Looks like some rescaling is happening here:

Image

@ashvardanian
Copy link
Contributor

Is it same for types like f32 and f16?

@liquidcarbon
Copy link
Author

Yes

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 1., 0.]], dtype=float32)

@liquidcarbon
Copy link
Author

liquidcarbon commented Apr 22, 2025

Additional context: I have a large Parquet dataset with vector column written as 1024-dim np.uint8 vectors, of which typically around 50-100 are non-zeroes.

I was trying to build an index with usearch, and the search results didn't make sense. Then I noticed that in the index there remained only a few (under 10) non-zero values in the vectors.

Amazon Linux 2023.6.20241010; r7i-large instance, if this helps

@liquidcarbon
Copy link
Author

The reason for uint8 was to use feature counts; I have no intuition whether using counts is any better than using bits (seems to be the go-to method). But I figured one can always turn uint counts to bits, but not the other way around.

@liquidcarbon
Copy link
Author

Fun fact: uint8 causes trouble but int8 works

@liquidcarbon liquidcarbon changed the title Bug: why are vector values altered when building index? Bug: unit8 misbehaves when building an index Apr 22, 2025
@liquidcarbon liquidcarbon changed the title Bug: unit8 misbehaves when building an index Bug: unsigned uint8 misbehaves when building an index Apr 22, 2025
@ashvardanian
Copy link
Contributor

That's a good hint, @liquidcarbon! The u8 support was added somewhat recently, if I remember correctly, and some of the tests were not extended to cover it. Would you be able to extend the existing test_index.py tests for for i8 to also have a u8 variant, and PR it?

@liquidcarbon
Copy link
Author

I'll take a look but if the root cause is in on the C side I must bow out :)

@ashvardanian
Copy link
Contributor

I'll take over the C patches, but having it covered with tests on the Python will be a good starting point for me. Thanks, @liquidcarbon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants