Skip to content

feat: Adding Docling RAG demo #5109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 2, 2025
Merged

feat: Adding Docling RAG demo #5109

merged 26 commits into from
Apr 2, 2025

Conversation

franciscojavierarceo
Copy link
Member

@franciscojavierarceo franciscojavierarceo commented Mar 1, 2025

What this PR does / why we need it:

This PR introduces enhancements to the RAG (Retrieval-Augmented Generation) demo, adds support for PDF transformation with Docling, and updates various components for better feature handling and testing.

  • examples/rag-docling/*

    • Updated the quickstart guide to include Docling and Milvus usage for transforming PDFs and storing/retrieving embeddings.
    • Added a new notebook (docling-demo.ipynb) demonstrating the use of Docling for text extraction from PDFs.
    • Added a new notebook (docling-quickstart.ipynb) showing how to use Feast to ingest and retrieve text data from the online store.
    • Introduced new functions for embedding text and generating chunk IDs.
    • Added support for Docling and Milvus in feature definitions and transformations.
    • Enhanced feature view configurations for PDF handling.
    • Updated the configuration to use Milvus as the online store with vector search capabilities.
  • sdk/python/feast/feature_store.py

    • Enhanced _get_feature_view_and_df_for_online_write to handle feature view transformations differently for singleton and non-singleton cases.
  • sdk/python/tests/unit/online_store/test_online_retrieval.py

    • Uncommented a previously skipped test for local Milvus to ensure it runs in CI.
  • sdk/python/tests/unit/online_store/test_online_writes.py

    • Added a new test class TestOnlineWritesWithTransform to validate PDF handling and feature transformations during online writes.
  • sdk/python/tests/unit/test_on_demand_python_transformation.py

    • Updated the test for docling_transform_docs to handle multiple inputs and validate the transformation logic.

Which issue(s) this PR fixes:

N/A

Misc:

N/A

Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
…t unique chunk-id

Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
@franciscojavierarceo franciscojavierarceo marked this pull request as ready for review April 2, 2025 03:40
@franciscojavierarceo franciscojavierarceo requested a review from a team as a code owner April 2, 2025 03:40
Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Francisco Javier Arceo <[email protected]>
@franciscojavierarceo franciscojavierarceo enabled auto-merge (squash) April 2, 2025 15:15
@franciscojavierarceo franciscojavierarceo merged commit 569404b into master Apr 2, 2025
46 of 47 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Apr 7, 2025
# [0.48.0](v0.47.0...v0.48.0) (2025-04-07)

### Bug Fixes

* Enhance integration logos display and styling in the UI ([#5221](#5221)) ([5799257](5799257))
* Fix space typo in push.md docs ([#5184](#5184)) ([81677b2](81677b2))
* Fixed integration tests for qdrant and milvus ([#5224](#5224)) ([d6b080d](d6b080d))
* Formatting trino ([760ec0e](760ec0e))
* Multiple fixes in retrieval of online documents ([#5168](#5168)) ([66ddd3e](66ddd3e))
* Operator route creation for Feast UI in OpenShift ([e3946b4](e3946b4))
* Remove entity_rows parameter from retrieve_online_documents_v2 call ([#5225](#5225)) ([2a2e304](2a2e304))
* Styling ([#5222](#5222)) ([34c393c](34c393c))
* typo in the chart ([bd3448b](bd3448b))
* Update milvus-quickstart and feature_store.yaml with correct Milvus Config ([#5200](#5200)) ([306acca](306acca))
* Update Qdrant online store paths in repo_config.py ([#5207](#5207)) ([ab35b0b](ab35b0b)), closes [#5206](#5206)
* Update the doc ([#5194](#5194)) ([726464e](726464e))
* Updated the operator-rabc example to test RBAC from a Kubernete pod ([#5147](#5147)) ([d23a1a5](d23a1a5))

### Features

* add `real`(float32) type for trino offline store ([#4749](#4749)) ([0947f96](0947f96))
* Add async DynamoDB timeout and retry configuration ([#5178](#5178)) ([2f3bcf5](2f3bcf5))
* Add CronJob capability to the Operator (feast apply & materialize-incremental) ([#5217](#5217)) ([285c0dc](285c0dc))
* Add RAG tutorial and Use Cases documentation ([#5226](#5226)) ([99f4004](99f4004))
* Added CLI for features, get historical and online features ([#5197](#5197)) ([4ab9f74](4ab9f74))
* Added export support in feast UI ([#5198](#5198)) ([b079553](b079553))
* Added global registry search support in Feast UI ([#5195](#5195)) ([f09ea49](f09ea49))
* Added UI for Features list ([#5192](#5192)) ([cc7fd47](cc7fd47))
* Adding blog on RAG with Milvus ([#5161](#5161)) ([b9e2e6c](b9e2e6c))
* Adding Docling RAG demo ([#5109](#5109)) ([569404b](569404b))
* Allow transformations on writes to output list of entities ([#5209](#5209)) ([955521a](955521a))
* Cache get_any_feature_view results ([#5175](#5175)) ([924b8a3](924b8a3))
* Clickhouse offline store ([#4725](#4725)) ([86794c2](86794c2))
* Enable keyword search for Milvus ([#5199](#5199)) ([ac44967](ac44967))
* Enable transformations on PDFs ([#5172](#5172)) ([3674971](3674971))
* Enable users to use Entity Query as CTE during historical retrieval ([#5202](#5202)) ([fe69eaf](fe69eaf))
* helm support more deployment config ([d575372](d575372))
* Improved CLI file structuring ([#5201](#5201)) ([972ed34](972ed34))
* Kickoff Transformation implementationtransformation code base ([#5181](#5181)) ([0083303](0083303))
* Make keep-alive timeout configurable for async DynamoDB connections ([#5167](#5167)) ([7f3e528](7f3e528))
* Operator mounts the odh-trusted-ca-bundle configmap when deployed on RHOAI or ODH ([d4d7b0d](d4d7b0d))
* Spark Transformation ([#5185](#5185)) ([be3d85c](be3d85c))
j-wine pushed a commit to j-wine/feast that referenced this pull request Jun 7, 2025
* feat: Adding Docling RAG demo

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updated demo

Signed-off-by: Francisco Javier Arceo <[email protected]>

* cleaned up notebook

Signed-off-by: Francisco Javier Arceo <[email protected]>

* adding chunk id

Signed-off-by: Francisco Javier Arceo <[email protected]>

* adding quickstart demo that is WIP and updating docling-demo to export unique chunk-id

Signed-off-by: Francisco Javier Arceo <[email protected]>

* adding current tentative exmaple repo

Signed-off-by: Francisco Javier Arceo <[email protected]>

* adding current temporary work

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updating demo script to rename things

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updated quickstart

Signed-off-by: Francisco Javier Arceo <[email protected]>

* added comment

Signed-off-by: Francisco Javier Arceo <[email protected]>

* checking in progress

Signed-off-by: Francisco Javier Arceo <[email protected]>

* checking in progress for now, still have some issues with vector retrieval

Signed-off-by: Francisco Javier Arceo <[email protected]>

* okay think i have most things working

Signed-off-by: Francisco Javier Arceo <[email protected]>

* removing commenting and unnecessary code

Signed-off-by: Francisco Javier Arceo <[email protected]>

* uploading demo

Signed-off-by: Francisco Javier Arceo <[email protected]>

* uploading other files

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updated repo exaxmple

Signed-off-by: Francisco Javier Arceo <[email protected]>

* checking in current notebook, almost there

Signed-off-by: Francisco Javier Arceo <[email protected]>

* fixed linter

Signed-off-by: Francisco Javier Arceo <[email protected]>

* fixed transformation logic:

Signed-off-by: Francisco Javier Arceo <[email protected]>

* removed print

Signed-off-by: Francisco Javier Arceo <[email protected]>

* added README with description

Signed-off-by: Francisco Javier Arceo <[email protected]>

* removing print

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updating

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updating metadata file

Signed-off-by: Francisco Javier Arceo <[email protected]>

* updated readme and adding dataset

Signed-off-by: Francisco Javier Arceo <[email protected]>

---------

Signed-off-by: Francisco Javier Arceo <[email protected]>
Signed-off-by: Jacob Weinhold <[email protected]>
j-wine pushed a commit to j-wine/feast that referenced this pull request Jun 7, 2025
# [0.48.0](feast-dev/feast@v0.47.0...v0.48.0) (2025-04-07)

### Bug Fixes

* Enhance integration logos display and styling in the UI ([feast-dev#5221](feast-dev#5221)) ([5799257](feast-dev@5799257))
* Fix space typo in push.md docs ([feast-dev#5184](feast-dev#5184)) ([81677b2](feast-dev@81677b2))
* Fixed integration tests for qdrant and milvus ([feast-dev#5224](feast-dev#5224)) ([d6b080d](feast-dev@d6b080d))
* Formatting trino ([760ec0e](feast-dev@760ec0e))
* Multiple fixes in retrieval of online documents ([feast-dev#5168](feast-dev#5168)) ([66ddd3e](feast-dev@66ddd3e))
* Operator route creation for Feast UI in OpenShift ([e3946b4](feast-dev@e3946b4))
* Remove entity_rows parameter from retrieve_online_documents_v2 call ([feast-dev#5225](feast-dev#5225)) ([2a2e304](feast-dev@2a2e304))
* Styling ([feast-dev#5222](feast-dev#5222)) ([34c393c](feast-dev@34c393c))
* typo in the chart ([bd3448b](feast-dev@bd3448b))
* Update milvus-quickstart and feature_store.yaml with correct Milvus Config ([feast-dev#5200](feast-dev#5200)) ([306acca](feast-dev@306acca))
* Update Qdrant online store paths in repo_config.py ([feast-dev#5207](feast-dev#5207)) ([ab35b0b](feast-dev@ab35b0b)), closes [feast-dev#5206](feast-dev#5206)
* Update the doc ([feast-dev#5194](feast-dev#5194)) ([726464e](feast-dev@726464e))
* Updated the operator-rabc example to test RBAC from a Kubernete pod ([feast-dev#5147](feast-dev#5147)) ([d23a1a5](feast-dev@d23a1a5))

### Features

* add `real`(float32) type for trino offline store ([feast-dev#4749](feast-dev#4749)) ([0947f96](feast-dev@0947f96))
* Add async DynamoDB timeout and retry configuration ([feast-dev#5178](feast-dev#5178)) ([2f3bcf5](feast-dev@2f3bcf5))
* Add CronJob capability to the Operator (feast apply & materialize-incremental) ([feast-dev#5217](feast-dev#5217)) ([285c0dc](feast-dev@285c0dc))
* Add RAG tutorial and Use Cases documentation ([feast-dev#5226](feast-dev#5226)) ([99f4004](feast-dev@99f4004))
* Added CLI for features, get historical and online features ([feast-dev#5197](feast-dev#5197)) ([4ab9f74](feast-dev@4ab9f74))
* Added export support in feast UI ([feast-dev#5198](feast-dev#5198)) ([b079553](feast-dev@b079553))
* Added global registry search support in Feast UI ([feast-dev#5195](feast-dev#5195)) ([f09ea49](feast-dev@f09ea49))
* Added UI for Features list ([feast-dev#5192](feast-dev#5192)) ([cc7fd47](feast-dev@cc7fd47))
* Adding blog on RAG with Milvus ([feast-dev#5161](feast-dev#5161)) ([b9e2e6c](feast-dev@b9e2e6c))
* Adding Docling RAG demo ([feast-dev#5109](feast-dev#5109)) ([569404b](feast-dev@569404b))
* Allow transformations on writes to output list of entities ([feast-dev#5209](feast-dev#5209)) ([955521a](feast-dev@955521a))
* Cache get_any_feature_view results ([feast-dev#5175](feast-dev#5175)) ([924b8a3](feast-dev@924b8a3))
* Clickhouse offline store ([feast-dev#4725](feast-dev#4725)) ([86794c2](feast-dev@86794c2))
* Enable keyword search for Milvus ([feast-dev#5199](feast-dev#5199)) ([ac44967](feast-dev@ac44967))
* Enable transformations on PDFs ([feast-dev#5172](feast-dev#5172)) ([3674971](feast-dev@3674971))
* Enable users to use Entity Query as CTE during historical retrieval ([feast-dev#5202](feast-dev#5202)) ([fe69eaf](feast-dev@fe69eaf))
* helm support more deployment config ([d575372](feast-dev@d575372))
* Improved CLI file structuring ([feast-dev#5201](feast-dev#5201)) ([972ed34](feast-dev@972ed34))
* Kickoff Transformation implementationtransformation code base ([feast-dev#5181](feast-dev#5181)) ([0083303](feast-dev@0083303))
* Make keep-alive timeout configurable for async DynamoDB connections ([feast-dev#5167](feast-dev#5167)) ([7f3e528](feast-dev@7f3e528))
* Operator mounts the odh-trusted-ca-bundle configmap when deployed on RHOAI or ODH ([d4d7b0d](feast-dev@d4d7b0d))
* Spark Transformation ([feast-dev#5185](feast-dev#5185)) ([be3d85c](feast-dev@be3d85c))

Signed-off-by: Jacob Weinhold <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants