conda create -n proxyv python=3.10 -y
conda activate proxyv
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"
For the pre-training stage, we use the 1.2M ShareGPT4V data which can be downloaded at this link For the fine-tuning stage, we use the public LLaVA-Next data which can be downloaded at this link
In our current implementation, we adopt the AnyRes strategy. The image features within each crop are flattened in raster order and concatenated crop by crop, similar to the UniRes strategy. We also append a newline separator token after each crop.
To process the vision tokens more conveniently, we pack tokens in the [vision tokens; proxy tokens; newline separator tokens; text tokens] order, and modify the position_ids and attention_masks accordingly to preserve their original relative order.
The pre-training scripts can be found within the scripts/pretrain
folder, and fine-tuning example scripts are provided under the scripts/finetune
folder.
To enable ProxyV, set --proxyv
to true
in the script and set --proxyv_start_layer
to the desired layer index.
The vicuna-1.5-7B ProxyV layer-12 model studied in the paper is provided at this [link](vicuna-1.5-7B ProxyV layer-12 model).
A simple inference example script is provided at demo.py
.
All benchmark evaluations can be directly conducted using lmms-eval with --model
set to llava
.
This project is under the Apache-2.0 license. See LICENSE for details.
Please consider citing our paper if you find this project helpful for your research:
@article{ProxyV,
author = {Wu, Penghao and Lu, Lewei and Liu, Ziwei},
title = {Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM},
journal={arXiv preprint arXiv:2505.15816},
year={2025}}
- This work is built upon LLaVA-NeXT.