-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ENH] Update Jina embedding function to support all models and configurations #4244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
5bcd97d
to
bf3a8fa
Compare
1addfda
to
63adc92
Compare
485d540
to
3a630fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your PR, much appreciated! I left some comments and looking forward to hear your feedback!
truncate (bool, optional): Whether to truncate the Jina AI API. | ||
Defaults to None. | ||
dimensions (int, optional): The number of dimensions to use for the Jina AI API. | ||
Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, the above configurations (from task to dimensions) is only supported by jina-embeddings-v3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add validation and doc string updates to inform, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after thinking about it, we would ideally want this embedding function validation to run server side. ie if we validate the user is only using these arguments with v3, if v4 comes out it with support for these arguments, it would require another PR to update validation. ive tested it with older models, and the API gives a fairly clear error that these arguments are not supported on those models. Thoughts on leaving validation logic as is?
@@ -51,7 +78,7 @@ def __call__(self, input: Documents) -> Embeddings: | |||
Get the embeddings for a list of texts. | |||
|
|||
Args: | |||
input (Documents): A list of texts or images to get embeddings for. | |||
input (Documents): A list of texts to get embeddings for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a question ,is the current integration support multimodal embeddings? i.e. jina-clip-v1 and jina-clip-v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now it doesnt. i can add that as part of the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: can't do that right now, since the format for input on multimodal models in jina does not match the existing ones. will have a workaround at a future point, so will keep this PR focused
jinaai_ef = JinaEmbeddingFunction( | ||
api_key="YOUR_API_KEY", | ||
model_name="jina-embeddings-v3", | ||
late_chunking=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, when model being set to jina-embeddings-v3, user is expected to set a task to get optimal performance.
in this example (QA/Retrieval) at indexing time task=retrieval.passage
, at searching time, task=retrieval.query
would be best.
however this might leads to 2 separate JinaEmbeddingFunction instances.. not very elegant, another option is set task=text-matching
, this should also give better result than not offering a task. We need to think about how to handle it better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same is expected for the jupyter notebook below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea for right now, we persist 1 configuration of embedding function per collection, so its constant between query time and insert time. will update to use text-matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, i think other embedding providers also do the task
trick, e.g. for different task use different encode function, i was wondering how Choma is handling that now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right that other providers do the task trick. for now, users would have to pass in the embedding function again during get_collection, which is not ideal. we'll have a more elegant solution, ideally with a configuration for query time
a36c3e9
to
572d1a5
Compare
docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md
Show resolved
Hide resolved
docs/docs.trychroma.com/markdoc/content/integrations/embedding-models/jina-ai.md
Outdated
Show resolved
Hide resolved
2b1b771
to
106b162
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
6e0669e
to
b1202c3
Compare
b1202c3
to
d89d056
Compare
Description of changes
This PR updates the Jina embedding function in both python and typescript to make v3 the default embedding model, and adds support for all configuration attributes that Jina supports.
this includes: late_chunking, task, truncate, dimensions, embedding_type, and normalized
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?