[dtensor] fix simplefsdp mixed-precision training bugs #154975

ruisizhang123 · 2025-06-03T07:00:49Z

This is a follow-up on the previous dtensor redistribute PR: #150740, which enables SimpleFSDP's mixed-precision training.

In the most recent integration in TorchTitan: pytorch/torchtitan#1250, we found some discrepancies between SimpleFSDP's fully_shard and replicate modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --local_tensor is taken out again from the original input. Thus, the dtensor used for communication has its original precision instead of using forward_dtype.

This PR fixes this issue and corrects previously added test cases.

After fixing the bug, the loss curves of fully_shard and replicate mode match perfectly.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-06-03T07:00:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154975

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit c172c07 with merge base a7e496a ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / build (gh) (#150261)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tianyu-l

thanks for the fix!

tianyu-l · 2025-06-03T14:40:07Z

@pytorchbot merge

pytorchmergebot · 2025-06-03T14:41:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This is a follow-up on the previous dtensor redistribute PR: pytorch#150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: pytorch/torchtitan#1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: pytorch#154975 Approved by: https://github.com/tianyu-l

fix simplefsdp mpt bugs

c172c07

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 3, 2025

pytorchbot added the open source label Jun 3, 2025

tianyu-l approved these changes Jun 3, 2025

View reviewed changes

tianyu-l added release notes: distributed (dtensor) release notes category ciflow/trunk Trigger trunk jobs on your pull request labels Jun 3, 2025

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot closed this in a1a268a Jun 3, 2025

pytorchmergebot added Merged and removed merging labels Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dtensor] fix simplefsdp mixed-precision training bugs #154975

[dtensor] fix simplefsdp mixed-precision training bugs #154975

Uh oh!

ruisizhang123 commented Jun 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 3, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

Uh oh!

[dtensor] fix simplefsdp mixed-precision training bugs #154975

[dtensor] fix simplefsdp mixed-precision training bugs #154975

Uh oh!

Conversation

ruisizhang123 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154975

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Merge started

Uh oh!

Uh oh!

ruisizhang123 commented Jun 3, 2025 •

edited

Loading

pytorch-bot bot commented Jun 3, 2025 •

edited

Loading