|
| 1 | +<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. |
| 14 | +--> |
| 15 | + |
| 16 | +# CogVideoX |
| 17 | + |
| 18 | +<!-- TODO: update paper with ArXiv link when ready. --> |
| 19 | + |
| 20 | +[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI. |
| 21 | + |
| 22 | +The abstract from the paper is: |
| 23 | + |
| 24 | +*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.* |
| 25 | + |
| 26 | +<Tip> |
| 27 | + |
| 28 | +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. |
| 29 | + |
| 30 | +</Tip> |
| 31 | + |
| 32 | +This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). |
| 33 | + |
| 34 | +## Inference |
| 35 | + |
| 36 | +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. |
| 37 | + |
| 38 | +First, load the pipeline: |
| 39 | + |
| 40 | +```python |
| 41 | +import torch |
| 42 | +from diffusers import CogVideoXPipeline |
| 43 | +from diffusers.utils import export_to_video |
| 44 | + |
| 45 | +pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda") |
| 46 | +prompt = ( |
| 47 | + "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " |
| 48 | + "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " |
| 49 | + "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " |
| 50 | + "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " |
| 51 | + "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " |
| 52 | + "atmosphere of this unique musical performance." |
| 53 | +) |
| 54 | +video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] |
| 55 | +export_to_video(video, "output.mp4", fps=8) |
| 56 | +``` |
| 57 | + |
| 58 | +Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: |
| 59 | + |
| 60 | +```python |
| 61 | +pipeline.transformer.to(memory_format=torch.channels_last) |
| 62 | +pipeline.vae.to(memory_format=torch.channels_last) |
| 63 | +``` |
| 64 | + |
| 65 | +Finally, compile the components and run inference: |
| 66 | + |
| 67 | +```python |
| 68 | +pipeline.transformer = torch.compile(pipeline.transformer) |
| 69 | +pipeline.vae.decode = torch.compile(pipeline.vae.decode) |
| 70 | + |
| 71 | +# CogVideoX works very well with long and well-described prompts |
| 72 | +prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." |
| 73 | +video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] |
| 74 | +``` |
| 75 | + |
| 76 | +The [benchmark](TODO: link) results on an 80GB A100 machine are: |
| 77 | + |
| 78 | +``` |
| 79 | +Without torch.compile(): Average inference time: TODO seconds. |
| 80 | +With torch.compile(): Average inference time: TODO seconds. |
| 81 | +``` |
| 82 | + |
| 83 | +## CogVideoXPipeline |
| 84 | + |
| 85 | +[[autodoc]] CogVideoXPipeline |
| 86 | + - all |
| 87 | + - __call__ |
| 88 | + |
| 89 | +## CogVideoXPipelineOutput |
| 90 | + |
| 91 | +[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput |
0 commit comments