fix: update docling prediction provider to include word cells #118

samiuc · 2025-06-04T05:11:22Z

DoclingPredictionProvider returned 0 F1, Recall, and Precision scores during evaluation and upon investigation, I found out that the word cells were not being populated in the SegmentedPage object that OCREvaluator uses for calculating metrics, so added a fix to copy cells from the original page to the parsed_page (SegmentedPage) word cells object.

Signed-off-by: samiullahchattha <[email protected]>

mergify · 2025-06-04T05:11:44Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: samiullahchattha <[email protected]>

docling_eval/prediction_providers/docling_provider.py

cau-git · 2025-06-04T13:48:01Z

docling_eval/prediction_providers/docling_provider.py

+                f"Page {page.page_no} has no parsed_page, cannot set word_cells."
+            )
+            return page.parsed_page
+        page.parsed_page.word_cells = page.cells


This is going to inject fake cells into the word_cells of a parsed_page. The only case in which that is currently a valid case is when the document was fully OCRed, hence this is a very partial workaround.

Also please consider that...

The standard OCR options in Docling are generating line-level cells (TextCellUnit.LINE), not word-level cells. So this would not compare.

The parsed_page will be populated from Docling, this code would simply overwrite the word_cells without checking if some are present (they will be present whenever the source PDF hat programmatic cells)

Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: samiuc <[email protected]>

Signed-off-by: samiullahchattha <[email protected]>

cau-git · 2025-06-11T13:17:17Z

This is addressed here: docling-project/docling#1745 for consistent line-level cells. Word-level cells cannot be easily retrieved across all OCR backends.

cau-git · 2025-06-16T13:33:48Z

@samiuc We have released now a docling version where the OCR cells are included in the SegmentedPage (parsed_page) output on the line-level cells.

Signed-off-by: samiullahchattha <[email protected]>

…rocessing Signed-off-by: samiullahchattha <[email protected]>

github-actions · 2025-06-20T01:54:01Z

✅ DCO Check Passed

Thanks @samiuc, all your commits are properly signed off. 🎉

Signed-off-by: samiullahchattha <[email protected]>

fix: update docling prediction provider to include word cells

7fd4341

Signed-off-by: samiullahchattha <[email protected]>

fix: missing parsed_page in set_word_cells method

eb8fdd7

Signed-off-by: samiullahchattha <[email protected]>

samiuc requested review from cau-git and PeterStaar-IBM June 4, 2025 05:34

cau-git reviewed Jun 4, 2025

View reviewed changes

samiuc and others added 3 commits June 4, 2025 08:50

Update docling_eval/prediction_providers/docling_provider.py

e0927c7

Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: samiuc <[email protected]>

Update docling_eval/prediction_providers/docling_provider.py

b61be2e

Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: samiuc <[email protected]>

fix: conditionally populate word_cells in _set_word_cells method

6ae4900

Signed-off-by: samiullahchattha <[email protected]>

samiuc mentioned this pull request Jun 5, 2025

[Bee] Add OCR-derived word cells into SegmentedPage docling-project/docling#1721

Closed

samiullahchattha added 2 commits June 19, 2025 13:07

Merge branch 'main' into sami/fix-docling-prediction-provider

0c80a96

Signed-off-by: samiullahchattha <[email protected]>

feat: Implement smart weighted character distribution for line text p…

89e1621

…rocessing Signed-off-by: samiullahchattha <[email protected]>

samiullahchattha added 3 commits June 20, 2025 11:58

fix: remove redundant field validators

8f206a4

Signed-off-by: samiullahchattha <[email protected]>

refactor: replace BoundingBoxDict with BoundingBox

e6bafb6

Signed-off-by: samiullahchattha <[email protected]>

refactor: update BoundingBox usage in prediction providers

9644ef4

Signed-off-by: samiullahchattha <[email protected]>

samiuc requested a review from cau-git June 20, 2025 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update docling prediction provider to include word cells #118

fix: update docling prediction provider to include word cells #118

samiuc commented Jun 4, 2025 •

edited

Loading

Uh oh!

mergify bot commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cau-git Jun 4, 2025

Uh oh!

cau-git commented Jun 11, 2025

Uh oh!

cau-git commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: update docling prediction provider to include word cells #118

Are you sure you want to change the base?

fix: update docling prediction provider to include word cells #118

Conversation

samiuc commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jun 4, 2025

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cau-git Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

cau-git commented Jun 11, 2025

Uh oh!

cau-git commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

samiuc commented Jun 4, 2025 •

edited

Loading

github-actions bot commented Jun 20, 2025 •

edited

Loading