Skip to content

Add SkyReels V2: Infinite-Length Film Generative Model #11518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 187 commits into
base: main
Choose a base branch
from

Conversation

tolgacangoz
Copy link
Contributor

@tolgacangoz tolgacangoz commented May 7, 2025

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

main_pipeline

TODOs:
FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
SkyReelsV2DiffusionForcingPipeline

SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
SkyReelsV2Pipeline
SkyReelsV2ImageToVideoPipeline
scripts/convert_skyreelsv2_to_diffusers.py
⬜ Did you make sure to update the documentation with your changes?
⬜ Did you write any new necessary tests?

T2V with Diffusion Forcing

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo diffusers integration
original_0_short.mp4
diffusers_0_short.mp4
seed 37 and num_frames 97
Original repo diffusers integration
original_37_short.mp4
diffusers_37_short.mp4
seed 0 and num_frames 257
Original repo diffusers integration
original_0_long.mp4
diffusers_0_long.mp4
seed 37 and num_frames 257
Original repo diffusers integration
original_37_long.mp4
diffusers_37_long.mp4
!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cuda").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "video.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

tolgacangoz and others added 7 commits May 7, 2025 21:53
…usion forcing

- Introduced the drafts of `SkyReelsV2TextToVideoPipeline`, `SkyReelsV2ImageToVideoPipeline`, `SkyReelsV2DiffusionForcingPipeline`, and `FlowUniPCMultistepScheduler`.
@ukaprch
Copy link

ukaprch commented May 8, 2025

It's about time. Thanks.

tolgacangoz added 22 commits May 8, 2025 20:01
Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module.
Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`.
Removes the `model_type` parameter, simplifying model architecture and attention block initialization.
Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.
…ds and begin reorganizing the forward pass.
…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.
…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.
…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.
…ForcingVideoToVideoPipeline`, enhancing support for Video-to-Video (v2v) generation. Introduce video input handling, update latent preparation logic, and improve error handling for input parameters.
… the `image_encoder` and `image_processor` dependencies. Update the CPU offload sequence accordingly.
@tolgacangoz tolgacangoz marked this pull request as draft May 29, 2025 08:07
…latent preparation logic and condition handling. Update image input type to `Optional`, streamline video condition processing, and improve handling of `last_image` during latent generation.
…ration for long video generation. Introduce new parameters for video handling, overlap history, and causal block size. Update logic to accommodate both short and long video scenarios, ensuring compatibility and improved processing.
…ideoToVideoPipeline` to ensure proper noise scaling during latent generation.
…pport for `last_image` parameter and refining latent frame calculations. Update preprocessing logic.
…eoPipeline` by correcting variable names and reintroducing latent mean and standard deviation calculations. Update logic for frame preparation and sampling to ensure accurate video generation.
…latent handling by enforcing tensor input for video, updating frame preparation logic, and adjusting default frame count. Enhance preprocessing and postprocessing steps for better integration.
…ForcingImageToVideoPipeline` to ensure correct dimensionality for video conditions and latent conditions.
…VideoPipeline` to handle tensor dimensions more robustly, ensuring compatibility with both 3D and 4D video inputs.
…teration print statements for better debugging. Clean up unused code related to prefix video latents length calculation in `SkyReelsV2DiffusionForcingImageToVideoPipeline`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Integrate SkyReels-V2 support in diffusers
3 participants