Skip to content

[ENH] Update Jina embedding function to support all models and configurations #4244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 17, 2025

Conversation

jairad26
Copy link
Contributor

@jairad26 jairad26 commented Apr 9, 2025

Description of changes

This PR updates the Jina embedding function in both python and typescript to make v3 the default embedding model, and adds support for all configuration attributes that Jina supports.
this includes: late_chunking, task, truncate, dimensions, embedding_type, and normalized

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Copy link

github-actions bot commented Apr 9, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

jairad26 commented Apr 9, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jairad26 jairad26 marked this pull request as ready for review April 9, 2025 15:28
@jairad26 jairad26 changed the base branch from jai/propogate-api-ef-error to graphite-base/4244 April 9, 2025 15:30
@jairad26 jairad26 force-pushed the jai/update-jina-ef branch from 5bcd97d to bf3a8fa Compare April 9, 2025 15:30
@jairad26 jairad26 force-pushed the graphite-base/4244 branch from 1addfda to 63adc92 Compare April 9, 2025 15:30
@jairad26 jairad26 changed the base branch from graphite-base/4244 to main April 9, 2025 15:30
@jairad26 jairad26 force-pushed the jai/update-jina-ef branch 5 times, most recently from 485d540 to 3a630fd Compare April 9, 2025 20:24
Copy link

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your PR, much appreciated! I left some comments and looking forward to hear your feedback!

truncate (bool, optional): Whether to truncate the Jina AI API.
Defaults to None.
dimensions (int, optional): The number of dimensions to use for the Jina AI API.
Defaults to None.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note, the above configurations (from task to dimensions) is only supported by jina-embeddings-v3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add validation and doc string updates to inform, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after thinking about it, we would ideally want this embedding function validation to run server side. ie if we validate the user is only using these arguments with v3, if v4 comes out it with support for these arguments, it would require another PR to update validation. ive tested it with older models, and the API gives a fairly clear error that these arguments are not supported on those models. Thoughts on leaving validation logic as is?

@@ -51,7 +78,7 @@ def __call__(self, input: Documents) -> Embeddings:
Get the embeddings for a list of texts.

Args:
input (Documents): A list of texts or images to get embeddings for.
input (Documents): A list of texts to get embeddings for.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a question ,is the current integration support multimodal embeddings? i.e. jina-clip-v1 and jina-clip-v2

Copy link
Contributor Author

@jairad26 jairad26 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it doesnt. i can add that as part of the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: can't do that right now, since the format for input on multimodal models in jina does not match the existing ones. will have a workaround at a future point, so will keep this PR focused

Comment on lines 58 to 69
jinaai_ef = JinaEmbeddingFunction(
api_key="YOUR_API_KEY",
model_name="jina-embeddings-v3",
late_chunking=True,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, when model being set to jina-embeddings-v3, user is expected to set a task to get optimal performance.

in this example (QA/Retrieval) at indexing time task=retrieval.passage, at searching time, task=retrieval.query would be best.

however this might leads to 2 separate JinaEmbeddingFunction instances.. not very elegant, another option is set task=text-matching, this should also give better result than not offering a task. We need to think about how to handle it better

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same is expected for the jupyter notebook below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea for right now, we persist 1 configuration of embedding function per collection, so its constant between query time and insert time. will update to use text-matching

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, i think other embedding providers also do the task trick, e.g. for different task use different encode function, i was wondering how Choma is handling that now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right that other providers do the task trick. for now, users would have to pass in the embedding function again during get_collection, which is not ideal. we'll have a more elegant solution, ideally with a configuration for query time

@jairad26 jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from a36c3e9 to 572d1a5 Compare April 11, 2025 16:24
@jairad26 jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from 2b1b771 to 106b162 Compare April 15, 2025 19:01
Copy link

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jairad26 jairad26 force-pushed the jai/update-jina-ef branch 2 times, most recently from 6e0669e to b1202c3 Compare April 16, 2025 22:29
@jairad26 jairad26 force-pushed the jai/update-jina-ef branch from b1202c3 to d89d056 Compare April 16, 2025 22:37
@jairad26 jairad26 merged commit 5141898 into main Apr 17, 2025
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants