huggingface · yiyixuxu · Aug 7, 2024 · Jul 30, 2024 · Jul 30, 2024 · Jul 30, 2024
diff --git a/docs/source/en/api/loaders/single_file.md b/docs/source/en/api/loaders/single_file.md
@@ -22,6 +22,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
 
 ## Supported pipelines
 
+- [`CogVideoXPipeline`]
 - [`StableDiffusionPipeline`]
 - [`StableDiffusionImg2ImgPipeline`]
 - [`StableDiffusionInpaintPipeline`]
@@ -49,6 +50,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
 - [`UNet2DConditionModel`]
 - [`StableCascadeUNet`]
 - [`AutoencoderKL`]
+- [`AutoencoderKLCogVideoX`]
 - [`ControlNetModel`]
 - [`SD3Transformer2DModel`]
 

diff --git a/docs/source/en/api/models/autoencoderkl_cogvideox.md b/docs/source/en/api/models/autoencoderkl_cogvideox.md
@@ -0,0 +1,69 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+
+# AutoencoderKLCogVideoX
+
+The 3D variational autoencoder (VAE) model with KL loss using CogVideoX.
+
+## Loading from the original format
+
+By default, the [`AutoencoderKLCogVideoX`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded from the original format using [`FromOriginalModelMixin.from_single_file`] as follows:
+
+```py
+from diffusers import AutoencoderKLCogVideoX
+
+url = "THUDM/CogVideoX-2b"  # can also be a local file
+model = AutoencoderKLCogVideoX.from_single_file(url)
+
+```
+
+## AutoencoderKLCogVideoX
+
+[[autodoc]] AutoencoderKLCogVideoX
+    - decode
+    - encode
+    - all
+
+## CogVideoXSafeConv3d
+
+[[autodoc]] CogVideoXSafeConv3d
+
+## CogVideoXCausalConv3d
+
+[[autodoc]] CogVideoXCausalConv3d
+
+## CogVideoXSpatialNorm3D
+
+[[autodoc]] CogVideoXSpatialNorm3D
+
+## CogVideoXResnetBlock3D
+
+[[autodoc]] CogVideoXResnetBlock3D
+
+## CogVideoXDownBlock3D
+
+[[autodoc]] CogVideoXDownBlock3D
+
+## CogVideoXMidBlock3D
+
+[[autodoc]] CogVideoXMidBlock3D
+
+## CogVideoXUpBlock3D
+
+[[autodoc]] CogVideoXUpBlock3D
+
+## CogVideoXEncoder3D
+
+[[autodoc]] CogVideoXEncoder3D
+
+## CogVideoXDecoder3D
+
+[[autodoc]] CogVideoXDecoder3D
diff --git a/docs/source/en/api/models/cogvideox_transformer3d.md b/docs/source/en/api/models/cogvideox_transformer3d.md
@@ -0,0 +1,18 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+
+## CogVideoXTransformer3DModel
+
+A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideoX).
+
+## CogVideoXTransformer3DModel
+
+[[autodoc]] CogVideoXTransformer3DModel
diff --git a/docs/source/en/api/pipelines/cogvideox.md b/docs/source/en/api/pipelines/cogvideox.md
@@ -0,0 +1,79 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. 
+
+## TODO: The paper is still being written.
+-->
+
+# CogVideoX
+
+[TODO]() from Tsinghua University & ZhipuAI.
+
+The abstract from the paper is:
+
+The paper is still being written.
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+### Inference
+
+Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
+
+First, load the pipeline:
+
+```python
+import torch
+from diffusers import LattePipeline
+
+pipeline = LattePipeline.from_pretrained(
+	"THUDM/CogVideoX-2b", torch_dtype=torch.float16
+).to("cuda")
+```
+
+Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
+
+```python
+pipeline.transformer.to(memory_format=torch.channels_last)
+pipeline.vae.to(memory_format=torch.channels_last)
+```
+
+Finally, compile the components and run inference:
+
+```python
+pipeline.transformer = torch.compile(pipeline.transformer)
+pipeline.vae.decode = torch.compile(pipeline.vae.decode)
+
+# CogVideoX works very well with long and well-described prompts
+prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
+video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
+```
+
+The [benchmark](TODO: link) results on an 80GB A100 machine are:
+
+```
+Without torch.compile(): Average inference time: TODO seconds.
+With torch.compile(): Average inference time: TODO seconds.
+```
+
+## CogVideoXPipeline
+
+[[autodoc]] CogVideoXPipeline
+  - all
+  - __call__
+
+## CogVideoXPipelineOutput
+[[autodoc]] pipelines.pipline_cogvideo.pipeline_output.CogVideoXPipelineOutput