Yume: An Interactive World Generation Model

Yume is a long-term project that aims to create an interactive, realistic, and dynamic world through the input of text, images, or videos.

A distillation recipes for video DiT.
FramePack-Like training code.
Long video generation method with DDP/FSDP sampling support

🔧 Installation

The code is tested on Python 3.10.0, CUDA 12.1 and A100.

./env_setup.sh fastvideo
pip install -r requirements.txt

You need to run pip install . after each code modification, or alternatively, you can copy the modified files directly into your virtual environment. For example, if I modified wan/image2video.py and my virtual environment is yume, I can copy the file to: envs/yume/lib/python3.10/site-packages/wan/image2video.py.

🚀 Inference

ODE

For image-to-video generation, we use --jpg_dir="./jpg" to specify the input image directory and --caption_path="./caption.txt" to provide text conditioning inputs, where each line corresponds to a generation instance controlling 2-second video output.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_jpg.sh

We also consider generating videos using the data from ./val, where --test_data_dir="./val" specifies the location of the example data.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample.sh

SDE

We perform TTS sampling, where args.sde controls whether to use SDE-based sampling.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_tts.sh

For optimal results, we recommend keeping Actual distance, Angular change rate (turn speed), and View rotation speed within the range of 0.1 to 10.

Key adjustment guidelines:

When executing Camera remains still (·), reduce the Actual distance value
When executing Person stands still, decrease both Angular change rate and View rotation speed values

Note that these parameters (Actual distance, Angular change rate, and View rotation speed) do impact generation results. As an alternative approach, you may consider removing these parameters entirely for simplified operation.

🎯 Training & Distill

For model training, we use args.MVDT to launch the MVDT framework, which requires at least 16 A100 GPUs. Loading T5 onto the CPU may help conserve GPU memory. We employ args.Distil to enable adversarial distillation.

# Download the model weights and place them in Path_To_Yume.
bash scripts/finetune/finetune.sh

🧱 Dataset Preparation

Please refer to https://github.com/Lixsp11/sekai-codebase to download the dataset. For the processed data format, refer to ./test_video.

path_to_processed_dataset_folder/
├── Keys_None_Mouse_Down/ 
│   ├── video_id.mp4
│   ├── video_id.txt
├── Keys_None_Mouse_Up
│──  ...
└── Keys_S_Mouse_·

The provided TXT file content record either camera motion control parameters or animation keyframe data, with the following field definitions:

Start Frame: 2 #Starting frame number (begins at frame 2 at origin video)

End Frame: 50 #Ending frame number

Duration: 49 frames #Total duration

Keys: W #Keyboard input

Mouse: ↓ #Mouse action

In scripts/finetune/finetune.sh, args.root_dir represents the path_to_processed_dataset_folder, and args.root_dir represents the full path to the Sekai dataset.

📑 Development Plan

Dataset processing
- Providing processed datasets
Code update
- fp8 support
- Better distillation methods
Model Update
- Quantized and Distilled Models
- Models for 720p Resolution Generation

🤝 Contributing

We welcome all contributions.

Acknowledgement

We learned and reused code from the following projects:

Citation

If you use Yume for your research, please cite our paper:

@article{mao2025yume,
  title={Yume: An Interactive World Generation Model},
  author={Mao, Xiaofeng and Lin, Shaoheng and Li, Zhen and Li, Chuanhao and Peng, Wenshuo and He, Tong and Pang, Jiangmiao and Chi, Mingmin and Qiao, Yu and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2507.17744},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
ADD		ADD
Yume-I2V-540P		Yume-I2V-540P
assets		assets
demo		demo
docs		docs
fastvideo		fastvideo
hyvideo		hyvideo
jpg		jpg
scripts		scripts
test_video		test_video
wan		wan
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
caption.txt		caption.txt
cog.yaml		cog.yaml
env_setup.sh		env_setup.sh
format.sh		format.sh
pyproject.toml		pyproject.toml
requirements-lint.txt		requirements-lint.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Yume: An Interactive World Generation Model

🔧 Installation

🚀 Inference

ODE

SDE

🎯 Training & Distill

🧱 Dataset Preparation

📑 Development Plan

🤝 Contributing

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

stdstu12/YUME

Folders and files

Latest commit

History

Repository files navigation

Yume: An Interactive World Generation Model

🔧 Installation

🚀 Inference

ODE

SDE

🎯 Training & Distill

🧱 Dataset Preparation

📑 Development Plan

🤝 Contributing

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages