torch distributed training with multi gpus errors in GRPOtrainer #3451

jinhonglu · 2025-05-15T12:30:16Z

Reproduction

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr
=127.0.0.1 --use_env --master_port=12345 train.py train.training_arguments.output_dir=outputs/test_grpo

I am fine-tuning Qwen2-0.5b model with GRPO, I successfully launched the training with 1 GPU. However, the training was failed when I tried to launch with multiple gpus.

state_dict = transformers.Qwen2ForCausalLM.from_pretrained(
        pretrained_model_name).state_dict()

The error shows that the linear layer mat2 requires 2-d matrix, not 1-d matrix.

    return F.linear(input, self.weight, self.bias)
RuntimeError: mat2 must be a matrix, got 1-D tensor

I had checked the inputs to the model in both 1 GPU and multi-GPUs training are the same. I wonder what causes this error?

I also faced kernel assertion

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.

I saw some previous issues saying that removing 'device_map="auto"' would resolve the problem, but it seems not in my case.

System Info

python3.9
torch 2.6.0
transformers 4.51.3
trl 0.17.0

h100 80G

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

xiaohaochen0308 · 2025-05-15T13:03:42Z

我也是，可否交流一下

jinhonglu · 2025-05-15T13:25:31Z

我也是，可否交流一下

我怀疑有可能是那个accumulation导致的，你有用accumulation吗

xiaohaochen0308 · 2025-05-15T13:26:52Z

19856014791 微信交流可否发自我的iPhone

…

------------------ Original ------------------ From: Jiko ***@***.***> Date: Thu,May 15,2025 9:25 PM To: huggingface/trl ***@***.***> Cc: Xiaohao Chen ***@***.***>, Comment ***@***.***> Subject: Re: [huggingface/trl] torch distributed training with multi gpuserrors in GRPOtrainer (Issue #3451)

shirinyamani · 2025-05-15T20:59:42Z

@jinhonglu @xiaohaochen0308
Please use English, as it can benefit larger community!

github-actions bot added 🏋 GRPO Related to GRPO 🐛 bug Something isn't working labels May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch distributed training with multi gpus errors in GRPOtrainer #3451

torch distributed training with multi gpus errors in GRPOtrainer #3451

jinhonglu commented May 15, 2025 •

edited

Loading

xiaohaochen0308 commented May 15, 2025

Uh oh!

jinhonglu commented May 15, 2025

Uh oh!

xiaohaochen0308 commented May 15, 2025 via email

Uh oh!

shirinyamani commented May 15, 2025

Uh oh!

torch distributed training with multi gpus errors in GRPOtrainer #3451

torch distributed training with multi gpus errors in GRPOtrainer #3451

Comments

jinhonglu commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction

System Info

Checklist

xiaohaochen0308 commented May 15, 2025

Uh oh!

jinhonglu commented May 15, 2025

Uh oh!

xiaohaochen0308 commented May 15, 2025 via email

Uh oh!

shirinyamani commented May 15, 2025

Uh oh!

jinhonglu commented May 15, 2025 •

edited

Loading