[ENH] Update Jina embedding function to support all models and configurations #4244

jairad26 · 2025-04-09T15:28:09Z

Description of changes

This PR updates the Jina embedding function in both python and typescript to make v3 the default embedding model, and adds support for all configuration attributes that Jina supports.
this includes: late_chunking, task, truncate, dimensions, embedding_type, and normalized

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

github-actions · 2025-04-09T15:28:22Z

jairad26 · 2025-04-09T15:28:29Z

[ENH] Update Jina embedding function to support all models and configurations #4244 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

bwanglzu

thanks for your PR, much appreciated! I left some comments and looking forward to hear your feedback!

chromadb/utils/embedding_functions/jina_embedding_function.py

bwanglzu · 2025-04-10T07:11:10Z

chromadb/utils/embedding_functions/jina_embedding_function.py

+            truncate (bool, optional): Whether to truncate the Jina AI API.
+                Defaults to None.
+            dimensions (int, optional): The number of dimensions to use for the Jina AI API.
+                Defaults to None.


side note, the above configurations (from task to dimensions) is only supported by jina-embeddings-v3

will add validation and doc string updates to inform, thanks

after thinking about it, we would ideally want this embedding function validation to run server side. ie if we validate the user is only using these arguments with v3, if v4 comes out it with support for these arguments, it would require another PR to update validation. ive tested it with older models, and the API gives a fairly clear error that these arguments are not supported on those models. Thoughts on leaving validation logic as is?

bwanglzu · 2025-04-10T07:12:04Z

chromadb/utils/embedding_functions/jina_embedding_function.py

@@ -51,7 +78,7 @@ def __call__(self, input: Documents) -> Embeddings:
        Get the embeddings for a list of texts.

        Args:
-            input (Documents): A list of texts or images to get embeddings for.
+            input (Documents): A list of texts to get embeddings for.


just a question ,is the current integration support multimodal embeddings? i.e. jina-clip-v1 and jina-clip-v2

right now it doesnt. i can add that as part of the PR

update: can't do that right now, since the format for input on multimodal models in jina does not match the existing ones. will have a workaround at a future point, so will keep this PR focused

bwanglzu · 2025-04-10T07:18:22Z

docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md

+jinaai_ef = JinaEmbeddingFunction(
+                api_key="YOUR_API_KEY",
+                model_name="jina-embeddings-v3",
+                late_chunking=True,
+            )


note, when model being set to jina-embeddings-v3, user is expected to set a task to get optimal performance.

in this example (QA/Retrieval) at indexing time task=retrieval.passage, at searching time, task=retrieval.query would be best.

however this might leads to 2 separate JinaEmbeddingFunction instances.. not very elegant, another option is set task=text-matching, this should also give better result than not offering a task. We need to think about how to handle it better

the same is expected for the jupyter notebook below

yea for right now, we persist 1 configuration of embedding function per collection, so its constant between query time and insert time. will update to use text-matching

sounds good, i think other embedding providers also do the task trick, e.g. for different task use different encode function, i was wondering how Choma is handling that now?

you're right that other providers do the task trick. for now, users would have to pass in the embedding function again during get_collection, which is not ideal. we'll have a more elegant solution, ideally with a configuration for query time

docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md

bwanglzu

LGTM!

…urations

jairad26 mentioned this pull request Apr 9, 2025

[CHORE] remove unused page, page_size, and sort args on get #4204

Merged

1 task

jairad26 mentioned this pull request Apr 9, 2025

[CHORE] Propogate error messages correctly to user #4235

Merged

1 task

jairad26 marked this pull request as ready for review April 9, 2025 15:28

jairad26 changed the base branch from jai/propogate-api-ef-error to graphite-base/4244 April 9, 2025 15:30

jairad26 force-pushed the jai/update-jina-ef branch from 5bcd97d to bf3a8fa Compare April 9, 2025 15:30

jairad26 force-pushed the graphite-base/4244 branch from 1addfda to 63adc92 Compare April 9, 2025 15:30

jairad26 changed the base branch from graphite-base/4244 to main April 9, 2025 15:30

jairad26 force-pushed the jai/update-jina-ef branch 5 times, most recently from 485d540 to 3a630fd Compare April 9, 2025 20:24

bwanglzu reviewed Apr 10, 2025

View reviewed changes

jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from a36c3e9 to 572d1a5 Compare April 11, 2025 16:24

HammadB reviewed Apr 11, 2025

View reviewed changes

docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md Show resolved Hide resolved

HammadB reviewed Apr 11, 2025

View reviewed changes

docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md Outdated Show resolved Hide resolved

jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from 2b1b771 to 106b162 Compare April 15, 2025 19:01

bwanglzu reviewed Apr 16, 2025

View reviewed changes

jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from 6e0669e to b1202c3 Compare April 16, 2025 22:29

[ENH] Update Jina embedding function to support all models and config…

d89d056

…urations

jairad26 force-pushed the jai/update-jina-ef branch from b1202c3 to d89d056 Compare April 16, 2025 22:37

jairad26 merged commit 5141898 into main Apr 17, 2025
69 checks passed

adityamaru mentioned this pull request Apr 17, 2025

[BLD]: migrate workflows to Blacksmith #4312

Closed

[ENH] Update Jina embedding function to support all models and configurations #4244

[ENH] Update Jina embedding function to support all models and configurations #4244

Uh oh!

Conversation

jairad26 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented Apr 9, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

jairad26 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwanglzu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jairad26 Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bwanglzu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jairad26 commented Apr 9, 2025 •

edited

Loading

jairad26 commented Apr 9, 2025 •

edited

Loading

jairad26 Apr 10, 2025 •

edited

Loading