Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models (ICCV 2025)
⭐ If PURE is helpful to your images or projects, please help star this repo. Thanks! 🤗
- 2025.07.07 🎉🎉🎉 Tanining code is released! 🎉🎉🎉
- 2025.04.11 🎉🎉🎉 Inference code and checkpoints are released! 🎉🎉🎉
- 2025.3.13 🎉🎉🎉 PURE is released! 🎉🎉🎉
git clone https://github.com/nonwhy/PURE.git && cd PURE
conda create -n pure python=3.10 -y
conda activate pure
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install -e .
You need to download the VQ-VAE model from LlamaGen and place it in pure/tokenizer/vq_ds8_c2i.pt
.
The simple code for PURE inference:
from inference_solver import FlexARInferenceSolver
from PIL import Image
from utils.wavelet_color_fix import wavelet_color_fix
inference_solver = FlexARInferenceSolver(
model_path="nonwhy/PURE",
precision="bf16",
target_size=512,
)
image_path = "/path/to/example_input.png"
image = Image.open(image_path)
if image.size != (512, 512):
image = image.resize((512, 512), Image.BICUBIC)
q1 = "Perceive the degradation level, understand the image content, and restore the high-quality image. <|image|>"
images = [image]
qas = [[q1, None]]
generated = inference_solver.generate(
images=images,
qas=qas,
max_gen_len=8192,
temperature=0.9,
logits_processor=inference_solver.create_logits_processor(
cfg=0.8,
text_top_k=1
),
)
new_image = generated[1][0]
new_image = wavelet_color_fix(new_image, image)
new_image.save("./example_output.png", "PNG")
text=generated[0]
print(text)
The final output image resolution is 512x512, so your input image resolution should be 128x128 (4x), 256x256 (2x), or 512x512 (1x). You can adjust the temperature
and CFG(Classifier-Free Guidance)
parameters to achieve different restoration results.
We can seamlessly accelerate inference through Speculative Jacobi Decoding:
python test_pure_jacobi.py
Please refer to TRAIN.md.
Following SeeSR, we train PURE on LSDIR+FFHQ10k. To generate realistic LQ-HQ image pairs for training, we apply the degradation pipeline from Real-ESRGAN.
Model | Size | Resolution | Huggingface |
---|---|---|---|
PURE | 7B | 512 | nonwhy/PURE |
Thanks to the following excellent open-source projects:
- Chameleon: Mixed-Modal Early-Fusion Foundation Models
- Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
- Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
This project is released under the MIT License.
If you have any questions, please feel free to contact: [email protected]
If our PURE helps your research or work, please consider citing our paper:
@misc{wei2025perceiveunderstandrestorerealworld,
title={Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models},
author={Hongyang Wei and Shuaizheng Liu and Chun Yuan and Lei Zhang},
year={2025},
eprint={2503.11073},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.11073},
}