[BUG] - Multiple 5090s failing on deepspeed.initialize()

**Describe the bug**

The developer of the training code Diffusion-pipe helped me debug this, the issue on that repository has all the relevant information that I have now. His summary:

> So plain PyTorch GPU communication ops work. But deepspeed.initialize() is always failing when it does its version of cross-GPU communication. Myself and other users have this working, but it fails specifically with multiple 5090s, and you are probably the only person who has tried that setup.

> I would raise an issue with Deepspeed. I don't think I've done anything wrong in the application code, and it is likely an internal Deepspeed problem. Without being able to reproduce the error myself, there's not much more I can do.

Full issue: https://github.com/tdrussell/diffusion-pipe/issues/235#issuecomment-2831270369


**To Reproduce**
Steps to reproduce the behavior:

Run deepspeed.initialize() with 2 x 5090 GPUs


**System info (please complete the following information):**
 - OS: ubuntu 24.04
 - GPU count and types: 1 machine with 2 x 5090s
 - Python version: 3.12
 - Any other relevant info about your setup: All latest Nvidia drivers, pytorch nightly etc.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] - Multiple 5090s failing on deepspeed.initialize() #7261

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] - Multiple 5090s failing on deepspeed.initialize() #7261

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions