Skip to content

Performance Issue of hy3dgen/shapegen/models/denoisers/GELU(nn.Module) 1.55X module speed up is observed on RTX3090 #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
David-Dingle opened this issue May 5, 2025 · 0 comments

Comments

@David-Dingle
Copy link

David-Dingle commented May 5, 2025

Sys env:

python=3.11.9
Hunyuan3D-2 : b8d6b65
OS: Ubuntu 22.04
CUDA: Nvidia RTX3090 with PyTorch260 cu12-4

Description:

Source Code
forces every non-contiguous tensor passed into GeLU to make a contiguous copy before activation.

torch.nn.functional.gelu accepts both contiguous and non-contiguous tensors as input. Certainly, contiguous operands shorten kernel execution time. But the Contiguous() operation is expensive, and the overhead is observed to be greater than the time that contiguous operands save.

Reproduce:

import time

import torch
from PIL import Image

from hy3dgen.rembg import BackgroundRemover
from hy3dgen.shapegen import Hunyuan3DDiTFlowMatchingPipeline

images = {
    "front": "assets/example_mv_images/1/front.png",
    "left": "assets/example_mv_images/1/left.png",
    "back": "assets/example_mv_images/1/back.png"
}

for key in images:
    image = Image.open(images[key]).convert("RGBA")
    if image.mode == 'RGB':
        rembg = BackgroundRemover()
        image = rembg(image)
    images[key] = image

pipeline = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained(
    'tencent/Hunyuan3D-2mv',
    subfolder='hunyuan3d-dit-v2-mv',
    variant='fp16'
)

start_time = time.time()
mesh = pipeline(
    image=images,
    num_inference_steps=50,
    octree_resolution=380,
    num_chunks=20000,
    generator=torch.manual_seed(12345),
    output_type='trimesh'
)[0]

Inference data will go through GeLU multiple times.
After profiling the kernel execution time on the following scope:

x = x.contiguous() # pending to removal
nn.functional.gelu(x, approximate=self.approximate)

By removing the contiguous, the kernel exec time is shortened from 1148.51 ms to 740.17 ms.

@David-Dingle David-Dingle changed the title Performance Issue of hy3dgen/shapegen/models/denoisers/GELU(nn.Module) Performance Issue of hy3dgen/shapegen/models/denoisers/GELU(nn.Module) 1.55X module speed up is observed on RTX3090 May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant