Skip to content

Multi gpu setup got stuck with accelerate but torchrun works #3568

Open
@Jason3900

Description

@Jason3900

System Info

- `Accelerate` version: 1.6.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /root/miniconda3/envs/torch_env/bin/accelerate
- Python version: 3.10.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.7.0+cu128 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 2015.36 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I use a simple test script from other issues:

import torch
import socket

print("============", socket.gethostname(), "===========")

num = torch.cuda.device_count()
infos = [torch.cuda.get_device_properties(i) for i in range(num)]
print(infos)

and the command to launch(failed) is:
accelerate launch --num_processes=2 run.py

Firstly it failed because it tries to connect the ipv6 address of another non-exist host:

[c10d] The IPv6 network addresses of (job-b1a4dd18-fb07-4177-9b83-6065e29665ac-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known).

I found it in the env and replace all related variables to my current hostname, and use ping to access the host I set to verify its connectivity.(The related variables are MASTER_ADDR, PET_MASTER_ADDR, HOSTNAME).
Then it just got stuck in launching the script without any info to stdout. I think it will trigger a timeout error in the end.

But with torchrun, the script runs successfully. The command is:
torchrun --nproc-per-node 2 --nnodes 1 run.py

Expected behavior

At least it should provide some information for me to debug, and I'm wondering the difference between torchrun and accelerate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions