Skip to content

Image-splitting: Enabling higher resolutions through image-splitting #62

Open
@AdonaiVera

Description

@AdonaiVera

I'm planning to work on an image-splitting feature to better support high-resolution inputs. Before diving in, I wanted to share my approach and get your thoughts to make sure it aligns with the repo's design.

My current plan:

  • Move the image-splitting logic to the collator, so we can process batches of images (instead of handling them individually in the dataset class).
  • Update the image processor to accept a list of images and apply do_image_splitting as a param.
  • After resizing images for the vision encoder (e.g., max edge = 224), the processor would:
    • Split the images into patches in batch.
    • Include the original image in the batch for reference.
  • In the collator, after patching:
    I checked how smolVLM expand prompts using rows/cols and image tokens and noticed that it includes special tokens like <image>, <fake>, and <global>. I was planning to follow a similar approach by adding these tokens directly in the collator, once the image patches are ready. And ensuring that the dataset-level prompt also includes a placeholder for the image, so the collator knows where to inject the token(s).

A few questions:

  1. Does it make sense to include the splitting logic directly in the collator, or would you recommend handling it elsewhere?
  2. Is it okay to expand prompts using rows/cols and image tokens after patching (in the collator)?
  3. Would it be okay to insert the <image>, <fake>, and <global> tokens directly in the collator after patching (like smolVLM), and just ensure that the dataset prompt includes a placeholder so the collator knows where to inject them?

Happy to adjust the plan based on your feedback! Just wanted to double-check before I start pushing changes.

Thanks in advance 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions