-
Notifications
You must be signed in to change notification settings - Fork 6k
Add CogVideoX text-to-video generation model #9082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 95 commits
c8e5491
c341786
bd6efd5
bb91775
59e6669
45cb1f9
84ff56e
a3d827f
dc7e6e8
aff72ec
cb5348a
e982881
d963b1a
1696758
21a0fc1
d83c1f8
dfeb329
71bcb1e
0980f4d
ee40f0e
8fe54bc
1c661ce
73b041e
b305280
6bcafcb
ec9508c
3ae9413
2be7469
9f9d0cb
c43a8f5
5f183bf
470815c
e67cc5a
d45d199
73469f9
45f7127
a449ceb
4498cfc
2956866
bb4740c
e05f834
03c28ee
712ddbe
03ee7cd
a31db5f
351d1f0
d0b8db2
fe6f5d6
4c2e887
41da084
77558f3
e12458e
c33dd02
71e7c82
ec53a30
551c884
3def905
65f6211
21509aa
b42b079
477e12b
fd0831c
d99528b
c7ee165
61c6da0
fa7fa9c
6988cc3
ba4223a
312f7dc
1b1b26b
ba1855c
7360ea1
2f1b787
90aa8be
5781e01
92c8c00
fd11c0f
03580c0
01c2dff
311845f
1b1b737
2d9602c
fb6130f
511c9ef
123ecef
cf7369d
9c6b889
22dcceb
e4d65cc
6f4e60b
70a54a8
b3428ad
9a0b906
32da2e7
878f609
de9e0b2
9c086f5
62d94aa
5e4dd15
884ddd0
d1c575a
11224d9
70cea91
cbc4d32
14698d0
8be845d
827a70a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. --> | ||
|
||
# AutoencoderKLCogVideoX | ||
|
||
The 3D variational autoencoder (VAE) model with KL loss using CogVideoX. | ||
|
||
## Loading from the original format | ||
|
||
By default, the [`AutoencoderKLCogVideoX`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: | ||
|
||
```py | ||
from diffusers import AutoencoderKLCogVideoX | ||
|
||
url = "THUDM/CogVideoX-2b" # can also be a local file | ||
model = AutoencoderKLCogVideoX.from_single_file(url) | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
|
||
## AutoencoderKLCogVideoX | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[[autodoc]] AutoencoderKLCogVideoX | ||
- decode | ||
- encode | ||
- all | ||
|
||
## CogVideoXSafeConv3d | ||
|
||
[[autodoc]] CogVideoXSafeConv3d | ||
|
||
## CogVideoXCausalConv3d | ||
|
||
[[autodoc]] CogVideoXCausalConv3d | ||
|
||
## CogVideoXSpatialNorm3D | ||
|
||
[[autodoc]] CogVideoXSpatialNorm3D | ||
|
||
## CogVideoXResnetBlock3D | ||
|
||
[[autodoc]] CogVideoXResnetBlock3D | ||
|
||
## CogVideoXDownBlock3D | ||
|
||
[[autodoc]] CogVideoXDownBlock3D | ||
|
||
## CogVideoXMidBlock3D | ||
|
||
[[autodoc]] CogVideoXMidBlock3D | ||
|
||
## CogVideoXUpBlock3D | ||
|
||
[[autodoc]] CogVideoXUpBlock3D | ||
|
||
## CogVideoXEncoder3D | ||
|
||
[[autodoc]] CogVideoXEncoder3D | ||
|
||
## CogVideoXDecoder3D | ||
|
||
[[autodoc]] CogVideoXDecoder3D |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. --> | ||
|
||
## CogVideoXTransformer3DModel | ||
|
||
A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideoX). | ||
|
||
## CogVideoXTransformer3DModel | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[[autodoc]] CogVideoXTransformer3DModel |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
## TODO: The paper is still being written. | ||
--> | ||
|
||
# CogVideoX | ||
|
||
[TODO]() from Tsinghua University & ZhipuAI. | ||
|
||
The abstract from the paper is: | ||
|
||
The paper is still being written. | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<Tip> | ||
|
||
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. | ||
|
||
</Tip> | ||
|
||
### Inference | ||
yiyixuxu marked this conversation as resolved.
Show resolved
Hide resolved
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. | ||
|
||
First, load the pipeline: | ||
|
||
```python | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to include this code block to demonstrate torch.compile, or is it to show inference time without torch.compile? If it's not necessary, I'm more in favor of just showing the below to keep it simpler. # create pipeline
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.bfloat16).to("cuda")
# set to channels_last
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)
# compile
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
# inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8) |
||
import torch | ||
from diffusers import LattePipeline | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
pipeline = LattePipeline.from_pretrained( | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"THUDM/CogVideoX-2b", torch_dtype=torch.float16 | ||
a-r-r-o-w marked this conversation as resolved.
Show resolved
Hide resolved
|
||
).to("cuda") | ||
``` | ||
|
||
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: | ||
|
||
```python | ||
pipeline.transformer.to(memory_format=torch.channels_last) | ||
pipeline.vae.to(memory_format=torch.channels_last) | ||
``` | ||
|
||
Finally, compile the components and run inference: | ||
|
||
```python | ||
pipeline.transformer = torch.compile(pipeline.transformer) | ||
pipeline.vae.decode = torch.compile(pipeline.vae.decode) | ||
|
||
# CogVideoX works very well with long and well-described prompts | ||
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." | ||
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] | ||
``` | ||
|
||
The [benchmark](TODO: link) results on an 80GB A100 machine are: | ||
|
||
``` | ||
Without torch.compile(): Average inference time: TODO seconds. | ||
With torch.compile(): Average inference time: TODO seconds. | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can also include a tip section like we have in Flux: This way users are aware of the optimizations that are possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably also mention that users can benefit from context-parallel caching. |
||
## CogVideoXPipeline | ||
|
||
[[autodoc]] CogVideoXPipeline | ||
- all | ||
- __call__ | ||
|
||
## CogVideoXPipelineOutput | ||
[[autodoc]] pipelines.pipline_cogvideo.pipeline_output.CogVideoXPipelineOutput |
Uh oh!
There was an error while loading. Please reload this page.