Description
Describe the bug
OOM while training Qwen3-32B model on 2 Nodes,with 8 H100 GPUS with deepspeed stage3.
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3/modeling_qwen3.py", line 94, in forward
[rank10]: down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
[rank10]: ^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/peft/tuners/lora/layer.py", line 621, in forward
[rank10]: result = result + lora_B(lora_A(dropout(x))) * scaling
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 125, in forward
[rank10]: return F.linear(input, self.weight, self.bias)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 116, in zero3_linear_wrap
[rank10]: return LinearFunctionForZeroStage3.apply(input, weight)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
[rank10]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd
[rank10]: return fwd(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 64, in forward
[rank10]: output = input.matmul(weight.t())
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 MiB. GPU 2 has a total capacity of 79.10 GiB of which 67.88 MiB is free. Process 3033429 has 79.00 GiB memory in use. Of the allocated memory 75.18 GiB is allocated by PyTorch, and 2.69 GiB is reserved by PyTorch but unallocated.