Open
Description
I'm planning to work on an image-splitting feature to better support high-resolution inputs. Before diving in, I wanted to share my approach and get your thoughts to make sure it aligns with the repo's design.
My current plan:
- Move the image-splitting logic to the collator, so we can process batches of images (instead of handling them individually in the dataset class).
- Update the image processor to accept a list of images and apply
do_image_splitting
as a param. - After resizing images for the vision encoder (e.g., max edge = 224), the processor would:
- Split the images into patches in batch.
- Include the original image in the batch for reference.
- In the collator, after patching:
I checked how smolVLM expand prompts using rows/cols and image tokens and noticed that it includes special tokens like<image>
,<fake>
, and<global>
. I was planning to follow a similar approach by adding these tokens directly in the collator, once the image patches are ready. And ensuring that the dataset-level prompt also includes a placeholder for the image, so the collator knows where to inject thetoken(s).
A few questions:
- Does it make sense to include the splitting logic directly in the collator, or would you recommend handling it elsewhere?
- Is it okay to expand prompts using rows/cols and image tokens after patching (in the collator)?
- Would it be okay to insert the
<image>
,<fake>
, and<global>
tokens directly in the collator after patching (like smolVLM), and just ensure that the dataset prompt includes a placeholder so the collator knows where to inject them?
Happy to adjust the plan based on your feedback! Just wanted to double-check before I start pushing changes.
Thanks in advance 🙏
Metadata
Metadata
Assignees
Labels
No labels