Skip to content

[BUG]Training #7319

Open
Open
@sravan500

Description

@sravan500

Describe the bug
OOM while training Qwen3-32B model on 2 Nodes,with 8 H100 GPUS with deepspeed stage3.

[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3/modeling_qwen3.py", line 94, in forward
[rank10]: down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
[rank10]: ^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/peft/tuners/lora/layer.py", line 621, in forward
[rank10]: result = result + lora_B(lora_A(dropout(x))) * scaling
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank10]: return self._call_impl(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank10]: return inner()
[rank10]: ^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank10]: result = forward_call(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 125, in forward
[rank10]: return F.linear(input, self.weight, self.bias)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 116, in zero3_linear_wrap
[rank10]: return LinearFunctionForZeroStage3.apply(input, weight)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
[rank10]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd
[rank10]: return fwd(*args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/deepspeed/runtime/zero/linear.py", line 64, in forward
[rank10]: output = input.matmul(weight.t())
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 118.00 MiB. GPU 2 has a total capacity of 79.10 GiB of which 67.88 MiB is free. Process 3033429 has 79.00 GiB memory in use. Of the allocated memory 75.18 GiB is allocated by PyTorch, and 2.69 GiB is reserved by PyTorch but unallocated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions