Make group offloading compatible with torch.compile() #11605

sayakpaul · 2025-05-23T10:57:04Z

What does this PR do?

On H100 with Wan 14B, we get:

Compile: False offloading: False
Latency: 4601.803 ms (median over 1 runs)

Compile: True offloading: False
Latency: 3766.335 ms (median over 1 runs)

Compile: False offloading: True
Latency: 4918.042 ms (median over 1 runs)

Compile: True offloading: True
Latency: 4109.500 ms (median over 1 runs)

On RTX 4090, we get:

Compile: False offloading: True
Latency: 13658.121 ms (median over 1 runs)

Compile: True offloading: True
Latency: 11583.754 ms (median over 1 runs)

Code:

Expand

from diffusers import AutoModel
import torch 
torch.set_grad_enabled(False)
from torch.utils import benchmark
import argparse

torch._dynamo.config.cache_size_limit = 10000

def get_input_dict(**device_dtype_kwargs):
    # height: 480
    # width: 832
    # num_frames: 81
    # max_sequence_length: 512
    hidden_states = torch.randn(1, 16, 21, 60, 104, **device_dtype_kwargs)
    encoder_hidden_states = torch.randn(1, 512, 4096, **device_dtype_kwargs)
    timestep = torch.tensor([1.0], **device_dtype_kwargs)

    return {"hidden_states": hidden_states, "encoder_hidden_states": encoder_hidden_states, "timestep": timestep}

def get_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument("--compile", action="store_true")
    parser.add_argument("--go", action="store_true")
    return parser.parse_args()

if __name__ == "__main__":
    args = get_parser()
    transformer = AutoModel.from_pretrained(
        "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16
    )
    if not args.go:
        transformer.cuda()
    else:
        group_offload_kwargs={
            "onload_device": "cuda",
            "offload_device": "cpu",
            "offload_type": "block_level",
            "num_blocks_per_group": 1,
            "use_stream": True,
            "non_blocking": True,
        }
        transformer.enable_group_offload(**group_offload_kwargs)
    if args.compile:
        transformer.compile()
    
    input_kwargs = {"dtype": torch.bfloat16, "device": "cuda"} if not args.go else {"dtype": torch.bfloat16}
    input_dict = get_input_dict(**input_kwargs)

    for _ in range(4):  
        _ = transformer(**input_dict)

    latency_timer = benchmark.Timer(
        stmt="transformer(**input_dict)",
        setup="from __main__ import transformer, input_dict",
        num_threads=1,
        label="Go+Compilation inference latency",
    )

    latency_result = latency_timer.blocked_autorange(min_run_time=1)
    latency_ms = latency_result.median * 1e3
    print(f"Compile: {args.compile} offloading: {args.go}")
    print(f"Latency: {latency_ms:.3f} ms (median over {len(latency_result.times)} runs)")

As one would expect, using streams for overlapping compute with communication yields the best trade-off. Using record_stream=True gives additional speedups.

HuggingFaceDocBuilderDev · 2025-05-23T11:04:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w

Looks good, thanks!

Probably reason for requiring the disable here: pytorch/pytorch#92804

wip: check if we can make go compile compat

7414605

sayakpaul requested review from DN6 and a-r-r-o-w May 23, 2025 10:57

sayakpaul added performance Anything related to performance improvements, profiling and benchmarking torch.compile labels May 23, 2025

a-r-r-o-w approved these changes May 27, 2025

View reviewed changes

Merge branch 'main' into go-compilation

49bc30f

sayakpaul merged commit 5f5d02f into main May 27, 2025
16 checks passed

sayakpaul deleted the go-compilation branch May 27, 2025 03:26

DN6 added the roadmap Add to current release roadmap label Jun 5, 2025

github-project-automation bot added this to Diffusers Roadmap 0.35 Jun 5, 2025

github-project-automation bot moved this to In Progress in Diffusers Roadmap 0.35 Jun 5, 2025

DN6 moved this from In Progress to Done in Diffusers Roadmap 0.35 Jun 5, 2025

sayakpaul mentioned this pull request Jun 6, 2025

[tests] add test for torch.compile + group offloading #11670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make group offloading compatible with torch.compile() #11605

Make group offloading compatible with torch.compile() #11605

Uh oh!

sayakpaul commented May 23, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 23, 2025

Uh oh!

a-r-r-o-w left a comment

Uh oh!

Uh oh!

Uh oh!

Make group offloading compatible with torch.compile() #11605

Make group offloading compatible with torch.compile() #11605

Uh oh!

Conversation

sayakpaul commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 23, 2025

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented May 23, 2025 •

edited

Loading