Works like a charm. Thanks for this! #20

pinballelectronica · 2025-02-19T15:47:38Z

pinballelectronica
Feb 19, 2025

Just extending a thanks for the work and the sweet collaboration with the gods of quantization. Runs beautifully with Sage Attn on 2x 4090's in various configs, with regular lora loaders in ComfyUI (loras created using Diffusion Pipe). Very fast. 720x480x200 frames will complete in < 10 mins. Trying to understand if I should have better output if I size the video size to the various latent sizes that were bucketed prior to training. Thanks!

pollockjj · 2025-02-19T21:46:54Z

pollockjj
Feb 19, 2025
Maintainer

Hey, @pinballelectronica, glad you're getting some use out of the node. Please star if you haven't already and are so inclined.

So, in my testing (limited, but done with various resolutions and durations on a single 3090) that it is overall Pixel Load that matters. This is the data I have with your data point thrown in: 720x480x200 = 69,120,000 ~ 13Gigs of VRAM latent space at <10 minutes (I put it at 9.5). I assumed 16 steps but if yours was not FastHunyuan I will need to recalibrate the graph:

So I'd say you are most definitely faster than my comparable 3090 times with just Sage Attn (my 3090 would come in around 12.5 minutes by the looks of it) and that if you aren't using TeaCache already you might want to give it a try. If you have, then your times surprise me a bit. Windows or Linux if I may ask?

2 replies

pinballelectronica Feb 20, 2025
Author

Thorough response! Not using teacache. Windows 10 wsl2 u24.04 96gb ddr5 on a i9-14900 Actually wsl2 only sees I think 90. Fast 8 quant. I think Kijai's

pinballelectronica Feb 20, 2025
Author

I'm not using the hunyan sampler for my workflow so I need to integrate that to use teacache it looks like.

KyleMaas · 2025-05-19T13:45:11Z

KyleMaas
May 19, 2025

Was thinking about filing an issue for this and then realized discussions were open. I just wanted to post to say thank you as well! This is a phenomenal piece of software. I have AMD GPUs from two different generations with different amounts of VRAM, and this (plus the much better swapping behavior for huge GGUF models) makes running large models possible which I never could have before without doing two-step workflows of generating a latent, saving it, and then bringing it back into a separate workflow. I've switched over most of what I do now to using workflows with these nodes, especially UnetLoaderGGUFAdvancedDisTorchMultiGPU.

So, yeah, just wanted to say thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Works like a charm. Thanks for this! #20

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Works like a charm. Thanks for this! #20

Uh oh!

pinballelectronica Feb 19, 2025

Replies: 2 comments · 2 replies

Uh oh!

pollockjj Feb 19, 2025 Maintainer

Uh oh!

pinballelectronica Feb 20, 2025 Author

Uh oh!

pinballelectronica Feb 20, 2025 Author

Uh oh!

KyleMaas May 19, 2025

pinballelectronica
Feb 19, 2025

Replies: 2 comments 2 replies

pollockjj
Feb 19, 2025
Maintainer

pinballelectronica Feb 20, 2025
Author

pinballelectronica Feb 20, 2025
Author

KyleMaas
May 19, 2025