Skip to content

feat: Enable transformations on PDFs #5172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 21, 2025
Merged

Conversation

franciscojavierarceo
Copy link
Member

@franciscojavierarceo franciscojavierarceo commented Mar 21, 2025

What this PR does / why we need it:

This PR updates feature transformations, online store handling, and support for PDFs during transformation.

  • sdk/python/feast/infra/online_stores/milvus_online_store/milvus.py

    • Updated online_read to include a new condition for updating the feature_name_feast_primitive_type_map based on the table.schema.
    • Added support for string_val in the list of value types.
    • Modified error message to correctly reflect the unsupported value type and feature.
  • sdk/python/feast/feature_store.py

    • Enhanced _get_feature_view_and_df_for_online_write to handle feature view transformations differently for singleton and non-singleton cases.
    • Improved handling of transformed data and input data frame creation.
  • sdk/python/feast/on_demand_feature_view.py

    • Added support for PDF_BYTES in the random input construction.
  • sdk/python/feast/transformation/python_transformation.py

    • Improved type inference for nested lists and singleton feature views.
  • sdk/python/feast/types.py

    • Introduced PDF_BYTES to the primitive feast types and updated related mappings and enums.
  • sdk/python/feast/value_type.py

    • Added PDF_BYTES to the ValueType enum.
  • sdk/python/tests/unit/online_store/test_online_retrieval.py

    • Added a new test test_milvus_stored_writes_with_explode to validate storing and retrieving exploded document embeddings with Milvus online store.
    • Renamed test_milvus_lite_get_online_documents_v2 to test_milvus_lite_retrieve_online_documents_v2.
  • sdk/python/tests/unit/test_on_demand_python_transformation.py

    • Added platform and version-specific skip condition for tests.
    • Introduced new tests for PDF handling and document transformation with chunking and embedding.

Which issue(s) this PR fixes:

#5173

Misc:

N/A

Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
…t unique chunk-id

Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
@franciscojavierarceo franciscojavierarceo changed the title Enable pdf transforms feat: Enable transformations on PDFs Mar 21, 2025
Signed-off-by: Francisco Javier Arceo <[email protected]>
@franciscojavierarceo franciscojavierarceo marked this pull request as ready for review March 21, 2025 02:54
@franciscojavierarceo franciscojavierarceo requested a review from a team as a code owner March 21, 2025 02:54
@@ -678,6 +678,9 @@ def _construct_random_input(
) -> dict[str, Union[list[Any], Any]]:
rand_dict_value: dict[ValueType, Union[list[Any], Any]] = {
ValueType.BYTES: [str.encode("hello world")],
ValueType.PDF_BYTES: [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have a new type, maybe just use BYTES?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, running type inference on raw bytes will fail when using a transformation expecting PDF content.

Copy link
Collaborator

@HaoXuAI HaoXuAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just not sure if PDF_BYTES is a primitive value or not.

@franciscojavierarceo
Copy link
Member Author

We need it for validation of ODFV. Using raw bytes fails when you test with PDF otherwise.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Mar 21, 2025

We need it for validation of ODFV. Using raw bytes fails when you test with PDF otherwise.

sounds good

@franciscojavierarceo franciscojavierarceo merged commit 3674971 into master Mar 21, 2025
36 of 37 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Apr 7, 2025
# [0.48.0](v0.47.0...v0.48.0) (2025-04-07)

### Bug Fixes

* Enhance integration logos display and styling in the UI ([#5221](#5221)) ([5799257](5799257))
* Fix space typo in push.md docs ([#5184](#5184)) ([81677b2](81677b2))
* Fixed integration tests for qdrant and milvus ([#5224](#5224)) ([d6b080d](d6b080d))
* Formatting trino ([760ec0e](760ec0e))
* Multiple fixes in retrieval of online documents ([#5168](#5168)) ([66ddd3e](66ddd3e))
* Operator route creation for Feast UI in OpenShift ([e3946b4](e3946b4))
* Remove entity_rows parameter from retrieve_online_documents_v2 call ([#5225](#5225)) ([2a2e304](2a2e304))
* Styling ([#5222](#5222)) ([34c393c](34c393c))
* typo in the chart ([bd3448b](bd3448b))
* Update milvus-quickstart and feature_store.yaml with correct Milvus Config ([#5200](#5200)) ([306acca](306acca))
* Update Qdrant online store paths in repo_config.py ([#5207](#5207)) ([ab35b0b](ab35b0b)), closes [#5206](#5206)
* Update the doc ([#5194](#5194)) ([726464e](726464e))
* Updated the operator-rabc example to test RBAC from a Kubernete pod ([#5147](#5147)) ([d23a1a5](d23a1a5))

### Features

* add `real`(float32) type for trino offline store ([#4749](#4749)) ([0947f96](0947f96))
* Add async DynamoDB timeout and retry configuration ([#5178](#5178)) ([2f3bcf5](2f3bcf5))
* Add CronJob capability to the Operator (feast apply & materialize-incremental) ([#5217](#5217)) ([285c0dc](285c0dc))
* Add RAG tutorial and Use Cases documentation ([#5226](#5226)) ([99f4004](99f4004))
* Added CLI for features, get historical and online features ([#5197](#5197)) ([4ab9f74](4ab9f74))
* Added export support in feast UI ([#5198](#5198)) ([b079553](b079553))
* Added global registry search support in Feast UI ([#5195](#5195)) ([f09ea49](f09ea49))
* Added UI for Features list ([#5192](#5192)) ([cc7fd47](cc7fd47))
* Adding blog on RAG with Milvus ([#5161](#5161)) ([b9e2e6c](b9e2e6c))
* Adding Docling RAG demo ([#5109](#5109)) ([569404b](569404b))
* Allow transformations on writes to output list of entities ([#5209](#5209)) ([955521a](955521a))
* Cache get_any_feature_view results ([#5175](#5175)) ([924b8a3](924b8a3))
* Clickhouse offline store ([#4725](#4725)) ([86794c2](86794c2))
* Enable keyword search for Milvus ([#5199](#5199)) ([ac44967](ac44967))
* Enable transformations on PDFs ([#5172](#5172)) ([3674971](3674971))
* Enable users to use Entity Query as CTE during historical retrieval ([#5202](#5202)) ([fe69eaf](fe69eaf))
* helm support more deployment config ([d575372](d575372))
* Improved CLI file structuring ([#5201](#5201)) ([972ed34](972ed34))
* Kickoff Transformation implementationtransformation code base ([#5181](#5181)) ([0083303](0083303))
* Make keep-alive timeout configurable for async DynamoDB connections ([#5167](#5167)) ([7f3e528](7f3e528))
* Operator mounts the odh-trusted-ca-bundle configmap when deployed on RHOAI or ODH ([d4d7b0d](d4d7b0d))
* Spark Transformation ([#5185](#5185)) ([be3d85c](be3d85c))
j-wine pushed a commit to j-wine/feast that referenced this pull request Jun 7, 2025
j-wine pushed a commit to j-wine/feast that referenced this pull request Jun 7, 2025
# [0.48.0](feast-dev/feast@v0.47.0...v0.48.0) (2025-04-07)

### Bug Fixes

* Enhance integration logos display and styling in the UI ([feast-dev#5221](feast-dev#5221)) ([5799257](feast-dev@5799257))
* Fix space typo in push.md docs ([feast-dev#5184](feast-dev#5184)) ([81677b2](feast-dev@81677b2))
* Fixed integration tests for qdrant and milvus ([feast-dev#5224](feast-dev#5224)) ([d6b080d](feast-dev@d6b080d))
* Formatting trino ([760ec0e](feast-dev@760ec0e))
* Multiple fixes in retrieval of online documents ([feast-dev#5168](feast-dev#5168)) ([66ddd3e](feast-dev@66ddd3e))
* Operator route creation for Feast UI in OpenShift ([e3946b4](feast-dev@e3946b4))
* Remove entity_rows parameter from retrieve_online_documents_v2 call ([feast-dev#5225](feast-dev#5225)) ([2a2e304](feast-dev@2a2e304))
* Styling ([feast-dev#5222](feast-dev#5222)) ([34c393c](feast-dev@34c393c))
* typo in the chart ([bd3448b](feast-dev@bd3448b))
* Update milvus-quickstart and feature_store.yaml with correct Milvus Config ([feast-dev#5200](feast-dev#5200)) ([306acca](feast-dev@306acca))
* Update Qdrant online store paths in repo_config.py ([feast-dev#5207](feast-dev#5207)) ([ab35b0b](feast-dev@ab35b0b)), closes [feast-dev#5206](feast-dev#5206)
* Update the doc ([feast-dev#5194](feast-dev#5194)) ([726464e](feast-dev@726464e))
* Updated the operator-rabc example to test RBAC from a Kubernete pod ([feast-dev#5147](feast-dev#5147)) ([d23a1a5](feast-dev@d23a1a5))

### Features

* add `real`(float32) type for trino offline store ([feast-dev#4749](feast-dev#4749)) ([0947f96](feast-dev@0947f96))
* Add async DynamoDB timeout and retry configuration ([feast-dev#5178](feast-dev#5178)) ([2f3bcf5](feast-dev@2f3bcf5))
* Add CronJob capability to the Operator (feast apply & materialize-incremental) ([feast-dev#5217](feast-dev#5217)) ([285c0dc](feast-dev@285c0dc))
* Add RAG tutorial and Use Cases documentation ([feast-dev#5226](feast-dev#5226)) ([99f4004](feast-dev@99f4004))
* Added CLI for features, get historical and online features ([feast-dev#5197](feast-dev#5197)) ([4ab9f74](feast-dev@4ab9f74))
* Added export support in feast UI ([feast-dev#5198](feast-dev#5198)) ([b079553](feast-dev@b079553))
* Added global registry search support in Feast UI ([feast-dev#5195](feast-dev#5195)) ([f09ea49](feast-dev@f09ea49))
* Added UI for Features list ([feast-dev#5192](feast-dev#5192)) ([cc7fd47](feast-dev@cc7fd47))
* Adding blog on RAG with Milvus ([feast-dev#5161](feast-dev#5161)) ([b9e2e6c](feast-dev@b9e2e6c))
* Adding Docling RAG demo ([feast-dev#5109](feast-dev#5109)) ([569404b](feast-dev@569404b))
* Allow transformations on writes to output list of entities ([feast-dev#5209](feast-dev#5209)) ([955521a](feast-dev@955521a))
* Cache get_any_feature_view results ([feast-dev#5175](feast-dev#5175)) ([924b8a3](feast-dev@924b8a3))
* Clickhouse offline store ([feast-dev#4725](feast-dev#4725)) ([86794c2](feast-dev@86794c2))
* Enable keyword search for Milvus ([feast-dev#5199](feast-dev#5199)) ([ac44967](feast-dev@ac44967))
* Enable transformations on PDFs ([feast-dev#5172](feast-dev#5172)) ([3674971](feast-dev@3674971))
* Enable users to use Entity Query as CTE during historical retrieval ([feast-dev#5202](feast-dev#5202)) ([fe69eaf](feast-dev@fe69eaf))
* helm support more deployment config ([d575372](feast-dev@d575372))
* Improved CLI file structuring ([feast-dev#5201](feast-dev#5201)) ([972ed34](feast-dev@972ed34))
* Kickoff Transformation implementationtransformation code base ([feast-dev#5181](feast-dev#5181)) ([0083303](feast-dev@0083303))
* Make keep-alive timeout configurable for async DynamoDB connections ([feast-dev#5167](feast-dev#5167)) ([7f3e528](feast-dev@7f3e528))
* Operator mounts the odh-trusted-ca-bundle configmap when deployed on RHOAI or ODH ([d4d7b0d](feast-dev@d4d7b0d))
* Spark Transformation ([feast-dev#5185](feast-dev#5185)) ([be3d85c](feast-dev@be3d85c))

Signed-off-by: Jacob Weinhold <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants