Skip to content

fix: update docling prediction provider to include word cells #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

samiuc
Copy link
Contributor

@samiuc samiuc commented Jun 4, 2025

DoclingPredictionProvider returned 0 F1, Recall, and Precision scores during evaluation and upon investigation, I found out that the word cells were not being populated in the SegmentedPage object that OCREvaluator uses for calculating metrics, so added a fix to copy cells from the original page to the parsed_page (SegmentedPage) word cells object.

Copy link

mergify bot commented Jun 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@samiuc samiuc requested review from cau-git and PeterStaar-IBM June 4, 2025 05:34
}
pred_record.status = res.status

return pred_record

def set_word_cells(self, page: Page):
if page.parsed_page is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is already done on the caller side, through ... for p in res.pages if p.parsed_page. No need to repeat it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the type checker was throwing error as it still sees parsed_page as Optional[SegmentedPdfPage] and throws the union-attr error. So I had to add this check explicitly

error: Item "None" of "SegmentedPdfPage | None" has no attribute "word_cells"  [union-attr]
            page.parsed_page.word_cells = page.cells
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~

f"Page {page.page_no} has no parsed_page, cannot set word_cells."
)
return page.parsed_page
page.parsed_page.word_cells = page.cells
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to inject fake cells into the word_cells of a parsed_page. The only case in which that is currently a valid case is when the document was fully OCRed, hence this is a very partial workaround.

Also please consider that...

  • The standard OCR options in Docling are generating line-level cells (TextCellUnit.LINE), not word-level cells. So this would not compare.
  • The parsed_page will be populated from Docling, this code would simply overwrite the word_cells without checking if some are present (they will be present whenever the source PDF hat programmatic cells)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants