Skip to content

Add CogVideoX text-to-video generation model #9082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 107 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
c8e5491
Create autoencoder_kl3d.py
zRzRzRzRzRzRzR Jul 30, 2024
c341786
vae draft
zRzRzRzRzRzRzR Jul 30, 2024
bd6efd5
initial draft of cogvideo transformer
a-r-r-o-w Jul 30, 2024
bb91775
add imports
a-r-r-o-w Jul 30, 2024
59e6669
fix attention mask
a-r-r-o-w Jul 30, 2024
45cb1f9
fix layernorms
a-r-r-o-w Jul 30, 2024
84ff56e
fix with some review guide
zRzRzRzRzRzRzR Jul 30, 2024
a3d827f
rename
zRzRzRzRzRzRzR Jul 30, 2024
dc7e6e8
fix error
zRzRzRzRzRzRzR Jul 30, 2024
aff72ec
Update autoencoder_kl3d.py
zRzRzRzRzRzRzR Jul 30, 2024
cb5348a
fix nasty bug in 3d sincos pos embeds
a-r-r-o-w Jul 30, 2024
e982881
refactor
a-r-r-o-w Jul 31, 2024
d963b1a
update conversion script for latest modeling changes
a-r-r-o-w Jul 31, 2024
1696758
remove debug prints
a-r-r-o-w Jul 31, 2024
21a0fc1
make style
a-r-r-o-w Jul 31, 2024
d83c1f8
add workflow to rebase with upstream main nightly.
sayakpaul Jul 29, 2024
dfeb329
add upstream
sayakpaul Jul 29, 2024
71bcb1e
Revert "add workflow to rebase with upstream main nightly."
sayakpaul Jul 29, 2024
0980f4d
add workflow for rebasing with upstream automatically.
sayakpaul Jul 29, 2024
ee40f0e
follow review guide
zRzRzRzRzRzRzR Jul 31, 2024
8fe54bc
add
zRzRzRzRzRzRzR Jul 31, 2024
1c661ce
remove deriving and using nn.module
zRzRzRzRzRzRzR Jul 31, 2024
73b041e
Merge branch 'cogvideox' into cogvideox-common-draft-1
a-r-r-o-w Jul 31, 2024
b305280
add skeleton for pipeline
a-r-r-o-w Jul 31, 2024
6bcafcb
make fix-copies
a-r-r-o-w Jul 31, 2024
ec9508c
Merge branch 'main' into cogvideox-common-draft-2
a-r-r-o-w Jul 31, 2024
3ae9413
undo unnecessary changes added on cogvideo-vae by mistake
a-r-r-o-w Jul 31, 2024
2be7469
groups->norm_num_groups
a-r-r-o-w Jul 31, 2024
9f9d0cb
verify CogVideoXSpatialNorm3D implementation
a-r-r-o-w Jul 31, 2024
c43a8f5
minor factor and repositioning of code in order of invocation
a-r-r-o-w Jul 31, 2024
5f183bf
reorder upsampling/downsampling blocks in order of invocation
a-r-r-o-w Jul 31, 2024
470815c
minor refactor
a-r-r-o-w Jul 31, 2024
e67cc5a
implement encode prompt
a-r-r-o-w Jul 31, 2024
d45d199
make style
a-r-r-o-w Jul 31, 2024
73469f9
make fix-copies
a-r-r-o-w Jul 31, 2024
45f7127
fix bug in handling long prompts
a-r-r-o-w Jul 31, 2024
a449ceb
update conversion script
a-r-r-o-w Jul 31, 2024
4498cfc
add doc draft
zRzRzRzRzRzRzR Jul 31, 2024
2956866
Merge branch 'cogvideox-common-draft-2' of https://github.com/hugging…
zRzRzRzRzRzRzR Jul 31, 2024
bb4740c
add clear_fake_cp_cache
zRzRzRzRzRzRzR Jul 31, 2024
e05f834
refactor vae
a-r-r-o-w Aug 1, 2024
03c28ee
modeling fixes
a-r-r-o-w Aug 1, 2024
712ddbe
make style
a-r-r-o-w Aug 1, 2024
03ee7cd
add pipeline implementation
a-r-r-o-w Aug 1, 2024
a31db5f
using with 226 instead of 225 of final weight
zRzRzRzRzRzRzR Aug 1, 2024
351d1f0
remove 0.transformer_blocks.encoder.embed_tokens.weight
zRzRzRzRzRzRzR Aug 1, 2024
d0b8db2
update
a-r-r-o-w Aug 1, 2024
fe6f5d6
ensure tokenizer config correctly uses 226 as text length
a-r-r-o-w Aug 1, 2024
4c2e887
add cogvideo specific attn processor
a-r-r-o-w Aug 1, 2024
41da084
remove debug prints
a-r-r-o-w Aug 1, 2024
77558f3
add pipeline docs
a-r-r-o-w Aug 1, 2024
e12458e
make style
a-r-r-o-w Aug 1, 2024
c33dd02
remove incorrect copied from
a-r-r-o-w Aug 1, 2024
71e7c82
vae problem fix
zRzRzRzRzRzRzR Aug 2, 2024
ec53a30
schedule
zRzRzRzRzRzRzR Aug 2, 2024
551c884
remove debug prints
a-r-r-o-w Aug 2, 2024
3def905
update
a-r-r-o-w Aug 2, 2024
65f6211
Merge pull request #4 from huggingface/cogvideox-refactor-to-diffusers
a-r-r-o-w Aug 2, 2024
21509aa
fp16 problem
zRzRzRzRzRzRzR Aug 2, 2024
b42b079
fix some comment
zRzRzRzRzRzRzR Aug 3, 2024
477e12b
fix
zRzRzRzRzRzRzR Aug 3, 2024
fd0831c
timestep fix
zRzRzRzRzRzRzR Aug 3, 2024
d99528b
Restore the timesteps parameter
zRzRzRzRzRzRzR Aug 3, 2024
c7ee165
Update downsampling.py
zRzRzRzRzRzRzR Aug 3, 2024
61c6da0
remove chunked ff code; reuse and refactor to support temb directly i…
a-r-r-o-w Aug 3, 2024
fa7fa9c
make inference 2-3x faster (by fixing the bug i introduced) 🚀😎
a-r-r-o-w Aug 3, 2024
6988cc3
new schedule with dpm
zRzRzRzRzRzRzR Aug 4, 2024
ba4223a
remove attenstion mask
zRzRzRzRzRzRzR Aug 4, 2024
312f7dc
apply suggestions from review
a-r-r-o-w Aug 4, 2024
1b1b26b
make style
a-r-r-o-w Aug 4, 2024
ba1855c
add workflow to rebase with upstream main nightly.
sayakpaul Jul 29, 2024
7360ea1
add upstream
sayakpaul Jul 29, 2024
2f1b787
Revert "add workflow to rebase with upstream main nightly."
sayakpaul Jul 29, 2024
90aa8be
add workflow for rebasing with upstream automatically.
sayakpaul Jul 29, 2024
5781e01
Merge branch 'huggingface:main' into main
a-r-r-o-w Aug 4, 2024
92c8c00
make fix-copies
a-r-r-o-w Aug 4, 2024
fd11c0f
Merge branch 'main' into cogvideox-common-draft-2
a-r-r-o-w Aug 4, 2024
03580c0
remove cogvideox-specific attention processor
a-r-r-o-w Aug 4, 2024
01c2dff
update docs
a-r-r-o-w Aug 4, 2024
311845f
update docs
a-r-r-o-w Aug 4, 2024
1b1b737
cogvideox branch
zRzRzRzRzRzRzR Aug 5, 2024
2d9602c
add CogVideoX team, Tsinghua University & ZhipuAI
zRzRzRzRzRzRzR Aug 5, 2024
fb6130f
Merge branch 'cogvideox-common-draft-2' of github.com:huggingface/dif…
zRzRzRzRzRzRzR Aug 5, 2024
511c9ef
merge remote branch
zRzRzRzRzRzRzR Aug 5, 2024
123ecef
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 5, 2024
cf7369d
fix some error
zRzRzRzRzRzRzR Aug 5, 2024
9c6b889
rename unsample and add some docs
zRzRzRzRzRzRzR Aug 5, 2024
22dcceb
messages
zRzRzRzRzRzRzR Aug 5, 2024
e4d65cc
update
yiyixuxu Aug 5, 2024
6f4e60b
Merge branch 'cogvideox-2b' of github.com:zRzRzRzRzRzRzR/diffusers in…
yiyixuxu Aug 5, 2024
70a54a8
use num_frames instead of num_seconds
a-r-r-o-w Aug 5, 2024
b3428ad
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 5, 2024
9a0b906
restore
zRzRzRzRzRzRzR Aug 5, 2024
32da2e7
Update lora_conversion_utils.py
zRzRzRzRzRzRzR Aug 5, 2024
878f609
remove dynamic guidance scale
a-r-r-o-w Aug 5, 2024
de9e0b2
address review comments
a-r-r-o-w Aug 6, 2024
9c086f5
dynamic cfg; fix cfg support
a-r-r-o-w Aug 6, 2024
62d94aa
address review comments
a-r-r-o-w Aug 6, 2024
5e4dd15
update tests
a-r-r-o-w Aug 6, 2024
884ddd0
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 6, 2024
d1c575a
fix docs error
a-r-r-o-w Aug 6, 2024
11224d9
alternative implementation to context parallel cache
a-r-r-o-w Aug 6, 2024
70cea91
Update docs/source/en/api/pipelines/cogvideox.md
yiyixuxu Aug 6, 2024
cbc4d32
remove tiling and slicing until their implementations are complete
a-r-r-o-w Aug 6, 2024
14698d0
Merge branch 'main' into cogvideox-2b
sayakpaul Aug 7, 2024
8be845d
Merge branch 'main' into cogvideox-2b
sayakpaul Aug 7, 2024
827a70a
Apply suggestions from code review
sayakpaul Aug 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/api/loaders/single_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:

## Supported pipelines

- [`CogVideoXPipeline`]
- [`StableDiffusionPipeline`]
- [`StableDiffusionImg2ImgPipeline`]
- [`StableDiffusionInpaintPipeline`]
Expand Down Expand Up @@ -49,6 +50,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
- [`UNet2DConditionModel`]
- [`StableCascadeUNet`]
- [`AutoencoderKL`]
- [`AutoencoderKLCogVideoX`]
- [`ControlNetModel`]
- [`SD3Transformer2DModel`]

Expand Down
69 changes: 69 additions & 0 deletions docs/source/en/api/models/autoencoderkl_cogvideox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# AutoencoderKLCogVideoX

The 3D variational autoencoder (VAE) model with KL loss using CogVideoX.

## Loading from the original format

By default, the [`AutoencoderKLCogVideoX`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded from the original format using [`FromOriginalModelMixin.from_single_file`] as follows:

```py
from diffusers import AutoencoderKLCogVideoX

url = "THUDM/CogVideoX-2b" # can also be a local file
model = AutoencoderKLCogVideoX.from_single_file(url)

```

## AutoencoderKLCogVideoX

[[autodoc]] AutoencoderKLCogVideoX
- decode
- encode
- all

## CogVideoXSafeConv3d

[[autodoc]] CogVideoXSafeConv3d

## CogVideoXCausalConv3d

[[autodoc]] CogVideoXCausalConv3d

## CogVideoXSpatialNorm3D

[[autodoc]] CogVideoXSpatialNorm3D

## CogVideoXResnetBlock3D

[[autodoc]] CogVideoXResnetBlock3D

## CogVideoXDownBlock3D

[[autodoc]] CogVideoXDownBlock3D

## CogVideoXMidBlock3D

[[autodoc]] CogVideoXMidBlock3D

## CogVideoXUpBlock3D

[[autodoc]] CogVideoXUpBlock3D

## CogVideoXEncoder3D

[[autodoc]] CogVideoXEncoder3D

## CogVideoXDecoder3D

[[autodoc]] CogVideoXDecoder3D
18 changes: 18 additions & 0 deletions docs/source/en/api/models/cogvideox_transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

## CogVideoXTransformer3DModel

A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideoX).

## CogVideoXTransformer3DModel

[[autodoc]] CogVideoXTransformer3DModel
79 changes: 79 additions & 0 deletions docs/source/en/api/pipelines/cogvideox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## TODO: The paper is still being written.
-->

# CogVideoX

[TODO]() from Tsinghua University & ZhipuAI.

The abstract from the paper is:

The paper is still being written.

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

</Tip>

### Inference

Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.

First, load the pipeline:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include this code block to demonstrate torch.compile, or is it to show inference time without torch.compile? If it's not necessary, I'm more in favor of just showing the below to keep it simpler.

# create pipeline
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.bfloat16).to("cuda")

# set to channels_last
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)

# compile
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)

# inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)

import torch
from diffusers import LattePipeline

pipeline = LattePipeline.from_pretrained(
"THUDM/CogVideoX-2b", torch_dtype=torch.float16
).to("cuda")
```

Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:

```python
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)
```

Finally, compile the components and run inference:

```python
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)

# CogVideoX works very well with long and well-described prompts
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
```

The [benchmark](TODO: link) results on an 80GB A100 machine are:

```
Without torch.compile(): Average inference time: TODO seconds.
With torch.compile(): Average inference time: TODO seconds.
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also include a tip section like we have in Flux:
https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux

This way users are aware of the optimizations that are possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also mention that users can benefit from context-parallel caching.

## CogVideoXPipeline

[[autodoc]] CogVideoXPipeline
- all
- __call__

## CogVideoXPipelineOutput
[[autodoc]] pipelines.pipline_cogvideo.pipeline_output.CogVideoXPipelineOutput
Loading
Loading