Description
Describe the bug
The developer of the training code Diffusion-pipe helped me debug this, the issue on that repository has all the relevant information that I have now. His summary:
So plain PyTorch GPU communication ops work. But deepspeed.initialize() is always failing when it does its version of cross-GPU communication. Myself and other users have this working, but it fails specifically with multiple 5090s, and you are probably the only person who has tried that setup.
I would raise an issue with Deepspeed. I don't think I've done anything wrong in the application code, and it is likely an internal Deepspeed problem. Without being able to reproduce the error myself, there's not much more I can do.
Full issue: tdrussell/diffusion-pipe#235 (comment)
To Reproduce
Steps to reproduce the behavior:
Run deepspeed.initialize() with 2 x 5090 GPUs
System info (please complete the following information):
- OS: ubuntu 24.04
- GPU count and types: 1 machine with 2 x 5090s
- Python version: 3.12
- Any other relevant info about your setup: All latest Nvidia drivers, pytorch nightly etc.