You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In that case, we would generate an image (thumbnail of the first page), stored as an asset, to populate /first-rows and /rows and display in the dataset viewer.
The priority is to have the PDF type detection and thumbnail IMO.
One way to tackle this is to add the PDF type detection in datasets for the bytes case. This way it will be easy to:
reuse the same logic as audio/image for the viewer
show the rendering of the first page of the PDF in the Viewer (rendered using e.g. pypdfium2)
add the "document" modality using the same logic as image/audio
(later and if there is interest) define DocumentFolder (or PdfFolder)
(later and if there is interest) support PDFs in WebDataset TAR files
(later and if there is interest) use lib to handle reading/writing like pypdfium2
(later and if there is interest) help teams/communities with document AI data loading (cc @molbap for viz)
Then for the URL case we can extend the image URL detection in step in the viewer, but I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL ?
In that case, we would generate an image (thumbnail of the first page), stored as an asset, to populate /first-rows and /rows and display in the dataset viewer.
asked internally on Slack: https://huggingface.slack.com/archives/C064HCHEJ2H/p1721215883166569 cc @Pleias
The text was updated successfully, but these errors were encountered: