[BUG]Training

**Describe the bug**
OOM while training Qwen3-32B model on 2 Nodes,with 8 H100 GPUS with deepspeed stage3.

[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]:     return self._call_impl(*args, **kwargs)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]:     return inner()
[rank10]:            ^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]:     result = forward_call(*args, **kwargs)
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3/modeling_qwen3.py", line 94, in forward
[rank10]:     down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
[rank10]:                                            ^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]:     return self._call_impl(*args, **kwargs)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]:     return inner()
[rank10]:            ^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]:     result = forward_call(*args, **kwargs)
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/peft/tuners/lora/layer.py", line 621, in forward
[rank10]:     result = result + lora_B(lora_A(dropout(x))) * scaling
[rank10]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]:     return self._call_impl(*args, **kwargs)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]:     return inner()
[rank10]:            ^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]:     result = forward_call(*args, **kwargs)
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 125, in forward
[rank10]:     return F.linear(input, self.weight, self.bias)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 116, in zero3_linear_wrap
[rank10]:     return LinearFunctionForZeroStage3.apply(input, weight)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
[rank10]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd
[rank10]:     return fwd(*args, **kwargs)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 64, in forward
[rank10]:     output = input.matmul(weight.t())
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 MiB. GPU 2 has a total capacity of 79.10 GiB of which 67.88 MiB is free. Process 3033429 has 79.00 GiB memory in use. Of the allocated memory 75.18 GiB is allocated by PyTorch, and 2.69 GiB is reserved by PyTorch but unallocated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]Training #7319

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]Training #7319

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions