Skip to content

Distributed example on C++ API (LibTorch) #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 19, 2021

Conversation

soumyadipghosh
Copy link
Contributor

@soumyadipghosh soumyadipghosh commented Aug 1, 2020

This PR adds an example involving distributed training using MPI on the C++ frontend similar to DistributedDataParallel in Python. This topic was raised in the forums here. Right now, this code is CPU only.

Please let me know if this PR can be a worthwhile contribution.

cc @yf225 @pietern

@yf225
Copy link
Contributor

yf225 commented Aug 2, 2020

@glaringlee @mrshenli Curious would this be a good addition? Thanks!

@glaringlee
Copy link

@yf225 This is good to me. I saw there is only 1 check (Run Examples) within this repo, is that as designed?
@mrshenli can you check the logic here, it seems fine to me though.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glaringlee @yf225 The distributed gradient averaging part looks OK to me. How do we test this in the example repo? And how do we make sure future changes in PyTorch won't break this example?

cc @agolynski for review as well

@glaringlee
Copy link

glaringlee commented Aug 3, 2020

@yf225 @mrshenli
I see there is a CI like test in this repo which run all the examples, any change that fail the execution will be detected in the test. For silence failure (logic change but won't stop the execution), can we add assert within the example so it breaks if any bc breaking (logical) change is made?

@mrshenli
Copy link
Contributor

mrshenli commented Aug 3, 2020

Do we need to add this new test to some .sh file, or will the CI automatically detect and include all new tests?

@glaringlee
Copy link

glaringlee commented Aug 3, 2020

Should add dist-mnist after this line I think, define dist-mnist() and do proper cleanup:
https://github.com/pytorch/examples/blob/master/run_python_examples.sh#L189
cc @yf225 @mrshenli @soumyadipghosh

@soumyadipghosh
Copy link
Contributor Author

Also, for running tests, an MPI installation is necessary. How will that be added to the CI pipeline?

@glaringlee
Copy link

I think the CI is totally controlled by run_python_examples.sh. MPI can be installed when installing dependencies.
cc @seemethere

@soumyadipghosh
Copy link
Contributor Author

cc @yf225 @glaringlee @mrshenli Can someone help me with the next steps ? I am not sure how to add the dependencies and the tests.

@glaringlee
Copy link

@malfet Are you familiar with this repo? Should we update run_python_examples.sh (add this test in) to make this test running on each code update? cc @mrshenli

@soumyadipghosh
Copy link
Contributor Author

Can someone please help me with the next steps? I think the run_python_examples.sh is only relevant for the python examples. It would be better to create something like run_cpp_examples.sh to test the cpp examples including this one.
cc @yf225 @glaringlee @mrshenli @malfet @seemethere @soumith

Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving an update on your work in the PyTorch forums! Adding some reviews inline

@facebook-github-bot
Copy link
Contributor

Hi @soumyadipghosh!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@soumyadipghosh
Copy link
Contributor Author

@osalpekar @mrshenli Since we haven't heard back about the ProcessGroup C++ APIs (see comments above), I created a separate file that involves NCCL for GPU communication that is somewhat based on the NCCL Example 2 here. I am not sure how to combine the CPU and GPU versions into one file because the NCCL routines will lead to errors when run on CPUs. Any ideas on the next steps?

@osalpekar
Copy link
Member

@osalpekar @mrshenli Since we haven't heard back about the ProcessGroup C++ APIs (see comments above), I created a separate file that involves NCCL for GPU communication that is somewhat based on the NCCL Example 2 here. I am not sure how to combine the CPU and GPU versions into one file because the NCCL routines will lead to errors when run on CPUs. Any ideas on the next steps?

I added a comment above about the ProcessGroup C++ APIs - I think these should significantly simplify some of the initialization boilerplate that's here for NCCL. For the sake of this example, adding support for multiple ProcessGroups might be a bit too much since these examples should demonstrate to users how to efficiently use PyTorch Distributed training - we can stick to MPI for this example and include a comment that NCCL or Gloo can be used as alternatives. I think we should be in good shape to merge this in once we move to the MPI ProcessGroup.

We also have an ongoing discussion about native DDP C++ support (pytorch/pytorch#48959), so please feel free to contribute ideas there as well!

@soumyadipghosh
Copy link
Contributor Author

@osalpekar I agree that we can just stick to MPI for this example and so I removed the NCCL code for now. Regarding the move to ProcessGroupMPI, I made some initial changes to the code based on the c10d example here but it doesn't compile! I have a lot of questions which I put as comments in the code, can you please clarify them?

@soumyadipghosh soumyadipghosh changed the title Distributed example on C++ Distributed example on C++ API (LibTorch) Jan 19, 2021
Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for your work on this example @soumyadipghosh!

@soumyadipghosh
Copy link
Contributor Author

soumyadipghosh commented Jan 29, 2021

@osalpekar Thanks for all your help and thanks to @lasagnaphil for the patches! I have verified the build and everything works great! Ready for merging :)

@soumyadipghosh
Copy link
Contributor Author

@osalpekar So how does this get merged?

@osalpekar osalpekar merged commit 2e8b5c5 into pytorch:master Feb 19, 2021
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
Distributed example on C++ API (LibTorch)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants