-
Notifications
You must be signed in to change notification settings - Fork 6
fix: update docling prediction provider to include word cells #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: samiullahchattha <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: samiullahchattha <[email protected]>
} | ||
pred_record.status = res.status | ||
|
||
return pred_record | ||
|
||
def set_word_cells(self, page: Page): | ||
if page.parsed_page is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is already done on the caller side, through ... for p in res.pages if p.parsed_page
. No need to repeat it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the type checker was throwing error as it still sees parsed_page as Optional[SegmentedPdfPage]
and throws the union-attr error. So I had to add this check explicitly
error: Item "None" of "SegmentedPdfPage | None" has no attribute "word_cells" [union-attr]
page.parsed_page.word_cells = page.cells
^~~~~~~~~~~~~~~~~~~~~~~~~~~
f"Page {page.page_no} has no parsed_page, cannot set word_cells." | ||
) | ||
return page.parsed_page | ||
page.parsed_page.word_cells = page.cells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to inject fake cells into the word_cells of a parsed_page. The only case in which that is currently a valid case is when the document was fully OCRed, hence this is a very partial workaround.
Also please consider that...
- The standard OCR options in Docling are generating line-level cells (TextCellUnit.LINE), not word-level cells. So this would not compare.
- The
parsed_page
will be populated from Docling, this code would simply overwrite the word_cells without checking if some are present (they will be present whenever the source PDF hat programmatic cells)
Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: samiuc <[email protected]>
Co-authored-by: Christoph Auer <[email protected]> Signed-off-by: samiuc <[email protected]>
Signed-off-by: samiullahchattha <[email protected]>
DoclingPredictionProvider returned 0 F1, Recall, and Precision scores during evaluation and upon investigation, I found out that the word cells were not being populated in the SegmentedPage object that OCREvaluator uses for calculating metrics, so added a fix to copy cells from the original page to the parsed_page (SegmentedPage) word cells object.