Yume is a long-term project that aims to create an interactive, realistic, and dynamic world through the input of text, images, or videos.
- A distillation recipes for video DiT.
- FramePack-Like training code.
- Long video generation method with DDP/FSDP sampling support
The code is tested on Python 3.10.0, CUDA 12.1 and A100.
./env_setup.sh fastvideo
pip install -r requirements.txt
You need to run pip install .
after each code modification, or alternatively, you can copy the modified files directly into your virtual environment. For example, if I modified wan/image2video.py
and my virtual environment is yume
, I can copy the file to:
envs/yume/lib/python3.10/site-packages/wan/image2video.py
.
For image-to-video generation, we use --jpg_dir="./jpg"
to specify the input image directory and --caption_path="./caption.txt"
to provide text conditioning inputs, where each line corresponds to a generation instance controlling 2-second video output.
# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_jpg.sh
We also consider generating videos using the data from ./val
, where --test_data_dir="./val"
specifies the location of the example data.
# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample.sh
We perform TTS sampling, where args.sde
controls whether to use SDE-based sampling.
# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_tts.sh
For optimal results, we recommend keeping Actual distance, Angular change rate (turn speed), and View rotation speed within the range of 0.1 to 10.
Key adjustment guidelines:
- When executing Camera remains still (·), reduce the Actual distance value
- When executing Person stands still, decrease both Angular change rate and View rotation speed values
Note that these parameters (Actual distance, Angular change rate, and View rotation speed) do impact generation results. As an alternative approach, you may consider removing these parameters entirely for simplified operation.
For model training, we use args.MVDT
to launch the MVDT framework, which requires at least 16 A100 GPUs. Loading T5 onto the CPU may help conserve GPU memory. We employ args.Distil
to enable adversarial distillation.
# Download the model weights and place them in Path_To_Yume.
bash scripts/finetune/finetune.sh
Please refer to https://github.com/Lixsp11/sekai-codebase to download the dataset. For the processed data format, refer to ./test_video
.
path_to_processed_dataset_folder/
├── Keys_None_Mouse_Down/
│ ├── video_id.mp4
│ ├── video_id.txt
├── Keys_None_Mouse_Up
│── ...
└── Keys_S_Mouse_·
The provided TXT file content record either camera motion control parameters or animation keyframe data, with the following field definitions:
Start Frame: 2 #Starting frame number (begins at frame 2 at origin video)
End Frame: 50 #Ending frame number
Duration: 49 frames #Total duration
Keys: W #Keyboard input
Mouse: ↓ #Mouse action
In scripts/finetune/finetune.sh
, args.root_dir
represents the path_to_processed_dataset_folder
, and args.root_dir
represents the full path to the Sekai dataset.
- Dataset processing
- Providing processed datasets
- Code update
- fp8 support
- Better distillation methods
- Model Update
- Quantized and Distilled Models
- Models for 720p Resolution Generation
We welcome all contributions.
We learned and reused code from the following projects:
If you use Yume for your research, please cite our paper:
@article{mao2025yume,
title={Yume: An Interactive World Generation Model},
author={Mao, Xiaofeng and Lin, Shaoheng and Li, Zhen and Li, Chuanhao and Peng, Wenshuo and He, Tong and Pang, Jiangmiao and Chi, Mingmin and Qiao, Yu and Zhang, Kaipeng},
journal={arXiv preprint arXiv:2507.17744},
year={2025}
}