Image-splitting: Enabling higher resolutions through image-splitting

I'm planning to work on an image-splitting feature to better support high-resolution inputs. Before diving in, I wanted to share my approach and get your thoughts to make sure it aligns with the repo's design.

###  My current plan:

* Move the image-splitting logic to the **collator**, so we can process batches of images (instead of handling them individually in the dataset class).
* Update the **image processor** to accept a **list of images** and apply `do_image_splitting` as a param.
* After resizing images for the vision encoder (e.g., max edge = 224), the processor would:
  * Split the images into patches in batch.
  * Include the original image in the batch for reference.
* In the **collator**, after patching: 
I checked how smolVLM expand prompts using rows/cols and image tokens and noticed that it includes special tokens like `<image>`, `<fake>`, and` <global>`. I was planning to follow a similar approach by adding these tokens directly in the collator, once the image patches are ready. And ensuring that the dataset-level prompt also includes a placeholder for the image, so the collator knows where to inject the <image> token(s).

###  A few questions:

1. Does it make sense to include the splitting logic directly in the **collator**, or would you recommend handling it elsewhere?
2. Is it okay to expand prompts using rows/cols and image tokens after patching (in the collator)?
3. Would it be okay to insert the `<image>`, `<fake>`, and `<global>` tokens directly in the collator after patching (like smolVLM), and just ensure that the dataset prompt includes a placeholder so the collator knows where to inject them?

Happy to adjust the plan based on your feedback! Just wanted to double-check before I start pushing changes.

Thanks in advance 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image-splitting: Enabling higher resolutions through image-splitting #62

My current plan:

A few questions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Image-splitting: Enabling higher resolutions through image-splitting #62

Description

My current plan:

A few questions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions