You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Source Code
forces every non-contiguous tensor passed into GeLU to make a contiguous copy before activation.
torch.nn.functional.gelu accepts both contiguous and non-contiguous tensors as input. Certainly, contiguous operands shorten kernel execution time. But the Contiguous() operation is expensive, and the overhead is observed to be greater than the time that contiguous operands save.
Reproduce:
import time
import torch
from PIL import Image
from hy3dgen.rembg import BackgroundRemover
from hy3dgen.shapegen import Hunyuan3DDiTFlowMatchingPipeline
images = {
"front": "assets/example_mv_images/1/front.png",
"left": "assets/example_mv_images/1/left.png",
"back": "assets/example_mv_images/1/back.png"
}
for key in images:
image = Image.open(images[key]).convert("RGBA")
if image.mode == 'RGB':
rembg = BackgroundRemover()
image = rembg(image)
images[key] = image
pipeline = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained(
'tencent/Hunyuan3D-2mv',
subfolder='hunyuan3d-dit-v2-mv',
variant='fp16'
)
start_time = time.time()
mesh = pipeline(
image=images,
num_inference_steps=50,
octree_resolution=380,
num_chunks=20000,
generator=torch.manual_seed(12345),
output_type='trimesh'
)[0]
Inference data will go through GeLU multiple times.
After profiling the kernel execution time on the following scope:
x = x.contiguous() # pending to removal
nn.functional.gelu(x, approximate=self.approximate)
By removing the contiguous, the kernel exec time is shortened from 1148.51 ms to 740.17 ms.
The text was updated successfully, but these errors were encountered:
David-Dingle
changed the title
Performance Issue of hy3dgen/shapegen/models/denoisers/GELU(nn.Module)
Performance Issue of hy3dgen/shapegen/models/denoisers/GELU(nn.Module) 1.55X module speed up is observed on RTX3090
May 5, 2025
Uh oh!
There was an error while loading. Please reload this page.
Sys env:
python=3.11.9
Hunyuan3D-2 : b8d6b65
OS: Ubuntu 22.04
CUDA: Nvidia RTX3090 with PyTorch260 cu12-4
Description:
Source Code
forces every non-contiguous tensor passed into GeLU to make a contiguous copy before activation.
torch.nn.functional.gelu accepts both contiguous and non-contiguous tensors as input. Certainly, contiguous operands shorten kernel execution time. But the Contiguous() operation is expensive, and the overhead is observed to be greater than the time that contiguous operands save.
Reproduce:
Inference data will go through GeLU multiple times.
After profiling the kernel execution time on the following scope:
By removing the contiguous, the kernel exec time is shortened from 1148.51 ms to 740.17 ms.
The text was updated successfully, but these errors were encountered: