Create a special column type when it contains PDF bytes or PDF URL #2991

severo · 2024-07-22T09:45:00Z

In that case, we would generate an image (thumbnail of the first page), stored as an asset, to populate /first-rows and /rows and display in the dataset viewer.

asked internally on Slack: https://huggingface.slack.com/archives/C064HCHEJ2H/p1721215883166569 cc @Pleias

lhoestq · 2024-07-22T10:46:12Z

The priority is to have the PDF type detection and thumbnail IMO.

One way to tackle this is to add the PDF type detection in datasets for the bytes case. This way it will be easy to:

reuse the same logic as audio/image for the viewer
show the rendering of the first page of the PDF in the Viewer (rendered using e.g. pypdfium2)
add the "document" modality using the same logic as image/audio
(later and if there is interest) define DocumentFolder (or PdfFolder)
(later and if there is interest) support PDFs in WebDataset TAR files
(later and if there is interest) use lib to handle reading/writing like pypdfium2
(later and if there is interest) help teams/communities with document AI data loading (cc @molbap for viz)

Then for the URL case we can extend the image URL detection in step in the viewer, but I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL ?

severo · 2024-07-22T10:47:55Z

I'm not sure if it's possible to render a thumbnail of a PDF in JS from a URL

good point, we can't do the same here.

severo · 2024-07-22T10:49:41Z

I opened huggingface/datasets#7058

AndreaFrancis · 2025-05-27T19:22:10Z

Also implement stars + filtering on PDF files as suggested here

severo added the feature request Request for a new feature label Jul 22, 2024

severo added blocked-by-upstream The issue must be fixed in a dependency P2 Nice to have labels Jul 22, 2024

severo mentioned this issue Jul 22, 2024

New feature type: Document huggingface/datasets#7058

Open

AndreaFrancis self-assigned this May 9, 2025

AndreaFrancis removed the blocked-by-upstream The issue must be fixed in a dependency label May 9, 2025

AndreaFrancis mentioned this issue May 13, 2025

feat: Thumbnail and PDF for new column type #3193

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a special column type when it contains PDF bytes or PDF URL #2991

Create a special column type when it contains PDF bytes or PDF URL #2991

severo commented Jul 22, 2024

lhoestq commented Jul 22, 2024 •

edited

Loading

Uh oh!

severo commented Jul 22, 2024

Uh oh!

severo commented Jul 22, 2024

Uh oh!

AndreaFrancis commented May 27, 2025

Uh oh!

Create a special column type when it contains PDF bytes or PDF URL #2991

Create a special column type when it contains PDF bytes or PDF URL #2991

Comments

severo commented Jul 22, 2024

lhoestq commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

severo commented Jul 22, 2024

Uh oh!

severo commented Jul 22, 2024

Uh oh!

AndreaFrancis commented May 27, 2025

Uh oh!

lhoestq commented Jul 22, 2024 •

edited

Loading