Skip to content

Commit dbf5d34

Browse files
zRzRzRzRzRzRzRa-r-r-o-wsayakpaulyiyixuxustevhliu
committed
Add CogVideoX text-to-video generation model (#9082)
* add CogVideoX --------- Co-authored-by: Aryan <[email protected]> Co-authored-by: sayakpaul <[email protected]> Co-authored-by: Aryan <[email protected]> Co-authored-by: yiyixuxu <[email protected]> Co-authored-by: Steven Liu <[email protected]>
1 parent 871d32e commit dbf5d34

26 files changed

+4113
-8
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,8 @@
239239
title: VQModel
240240
- local: api/models/autoencoderkl
241241
title: AutoencoderKL
242+
- local: api/models/autoencoderkl_cogvideox
243+
title: AutoencoderKLCogVideoX
242244
- local: api/models/asymmetricautoencoderkl
243245
title: AsymmetricAutoencoderKL
244246
- local: api/models/stable_cascade_unet
@@ -263,6 +265,8 @@
263265
title: FluxTransformer2DModel
264266
- local: api/models/latte_transformer3d
265267
title: LatteTransformer3DModel
268+
- local: api/models/cogvideox_transformer3d
269+
title: CogVideoXTransformer3DModel
266270
- local: api/models/lumina_nextdit2d
267271
title: LuminaNextDiT2DModel
268272
- local: api/models/transformer_temporal
@@ -302,6 +306,8 @@
302306
title: AutoPipeline
303307
- local: api/pipelines/blip_diffusion
304308
title: BLIP-Diffusion
309+
- local: api/pipelines/cogvideox
310+
title: CogVideoX
305311
- local: api/pipelines/consistency_models
306312
title: Consistency Models
307313
- local: api/pipelines/controlnet

docs/source/en/api/loaders/single_file.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
2222
2323
## Supported pipelines
2424

25+
- [`CogVideoXPipeline`]
2526
- [`StableDiffusionPipeline`]
2627
- [`StableDiffusionImg2ImgPipeline`]
2728
- [`StableDiffusionInpaintPipeline`]
@@ -49,6 +50,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
4950
- [`UNet2DConditionModel`]
5051
- [`StableCascadeUNet`]
5152
- [`AutoencoderKL`]
53+
- [`AutoencoderKLCogVideoX`]
5254
- [`ControlNetModel`]
5355
- [`SD3Transformer2DModel`]
5456
- [`FluxTransformer2DModel`]
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLCogVideoX
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLCogVideoX
20+
21+
vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## AutoencoderKLCogVideoX
25+
26+
[[autodoc]] AutoencoderKLCogVideoX
27+
- decode
28+
- encode
29+
- all
30+
31+
## AutoencoderKLOutput
32+
33+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
34+
35+
## DecoderOutput
36+
37+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# CogVideoXTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import CogVideoXTransformer3DModel
20+
21+
vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## CogVideoXTransformer3DModel
25+
26+
[[autodoc]] CogVideoXTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# CogVideoX
17+
18+
<!-- TODO: update paper with ArXiv link when ready. -->
19+
20+
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.
21+
22+
The abstract from the paper is:
23+
24+
*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*
25+
26+
<Tip>
27+
28+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
29+
30+
</Tip>
31+
32+
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
33+
34+
## Inference
35+
36+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
37+
38+
First, load the pipeline:
39+
40+
```python
41+
import torch
42+
from diffusers import CogVideoXPipeline
43+
from diffusers.utils import export_to_video
44+
45+
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda")
46+
prompt = (
47+
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
48+
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
49+
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
50+
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
51+
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
52+
"atmosphere of this unique musical performance."
53+
)
54+
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
55+
export_to_video(video, "output.mp4", fps=8)
56+
```
57+
58+
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
59+
60+
```python
61+
pipeline.transformer.to(memory_format=torch.channels_last)
62+
pipeline.vae.to(memory_format=torch.channels_last)
63+
```
64+
65+
Finally, compile the components and run inference:
66+
67+
```python
68+
pipeline.transformer = torch.compile(pipeline.transformer)
69+
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
70+
71+
# CogVideoX works very well with long and well-described prompts
72+
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
73+
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
74+
```
75+
76+
The [benchmark](TODO: link) results on an 80GB A100 machine are:
77+
78+
```
79+
Without torch.compile(): Average inference time: TODO seconds.
80+
With torch.compile(): Average inference time: TODO seconds.
81+
```
82+
83+
## CogVideoXPipeline
84+
85+
[[autodoc]] CogVideoXPipeline
86+
- all
87+
- __call__
88+
89+
## CogVideoXPipelineOutput
90+
91+
[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput

0 commit comments

Comments
 (0)