Description
System Info
- `Accelerate` version: 1.6.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /root/miniconda3/envs/torch_env/bin/accelerate
- Python version: 3.10.16
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.7.0+cu128 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 2015.36 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
I use a simple test script from other issues:
import torch
import socket
print("============", socket.gethostname(), "===========")
num = torch.cuda.device_count()
infos = [torch.cuda.get_device_properties(i) for i in range(num)]
print(infos)
and the command to launch(failed) is:
accelerate launch --num_processes=2 run.py
Firstly it failed because it tries to connect the ipv6 address of another non-exist host:
[c10d] The IPv6 network addresses of (job-b1a4dd18-fb07-4177-9b83-6065e29665ac-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known).
I found it in the env and replace all related variables to my current hostname, and use ping to access the host I set to verify its connectivity.(The related variables are MASTER_ADDR, PET_MASTER_ADDR, HOSTNAME).
Then it just got stuck in launching the script without any info to stdout. I think it will trigger a timeout error in the end.
But with torchrun, the script runs successfully. The command is:
torchrun --nproc-per-node 2 --nnodes 1 run.py
Expected behavior
At least it should provide some information for me to debug, and I'm wondering the difference between torchrun and accelerate.