Skip to content

[BUG] - Multiple 5090s failing on deepspeed.initialize() #7261

Open
@Oruli

Description

@Oruli

Describe the bug

The developer of the training code Diffusion-pipe helped me debug this, the issue on that repository has all the relevant information that I have now. His summary:

So plain PyTorch GPU communication ops work. But deepspeed.initialize() is always failing when it does its version of cross-GPU communication. Myself and other users have this working, but it fails specifically with multiple 5090s, and you are probably the only person who has tried that setup.

I would raise an issue with Deepspeed. I don't think I've done anything wrong in the application code, and it is likely an internal Deepspeed problem. Without being able to reproduce the error myself, there's not much more I can do.

Full issue: tdrussell/diffusion-pipe#235 (comment)

To Reproduce
Steps to reproduce the behavior:

Run deepspeed.initialize() with 2 x 5090 GPUs

System info (please complete the following information):

  • OS: ubuntu 24.04
  • GPU count and types: 1 machine with 2 x 5090s
  • Python version: 3.12
  • Any other relevant info about your setup: All latest Nvidia drivers, pytorch nightly etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions