-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Distributed example on C++ API (LibTorch) #809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@glaringlee @mrshenli Curious would this be a good addition? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaringlee @yf225 The distributed gradient averaging part looks OK to me. How do we test this in the example repo? And how do we make sure future changes in PyTorch won't break this example?
cc @agolynski for review as well
@yf225 @mrshenli |
Do we need to add this new test to some |
Should add dist-mnist after this line I think, define dist-mnist() and do proper cleanup: |
Also, for running tests, an MPI installation is necessary. How will that be added to the CI pipeline? |
I think the CI is totally controlled by run_python_examples.sh. MPI can be installed when installing dependencies. |
cc @yf225 @glaringlee @mrshenli Can someone help me with the next steps ? I am not sure how to add the dependencies and the tests. |
Can someone please help me with the next steps? I think the run_python_examples.sh is only relevant for the python examples. It would be better to create something like run_cpp_examples.sh to test the cpp examples including this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving an update on your work in the PyTorch forums! Adding some reviews inline
Hi @soumyadipghosh! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@osalpekar @mrshenli Since we haven't heard back about the ProcessGroup C++ APIs (see comments above), I created a separate file that involves NCCL for GPU communication that is somewhat based on the NCCL Example 2 here. I am not sure how to combine the CPU and GPU versions into one file because the NCCL routines will lead to errors when run on CPUs. Any ideas on the next steps? |
I added a comment above about the ProcessGroup C++ APIs - I think these should significantly simplify some of the initialization boilerplate that's here for NCCL. For the sake of this example, adding support for multiple ProcessGroups might be a bit too much since these examples should demonstrate to users how to efficiently use PyTorch Distributed training - we can stick to MPI for this example and include a comment that NCCL or Gloo can be used as alternatives. I think we should be in good shape to merge this in once we move to the MPI ProcessGroup. We also have an ongoing discussion about native DDP C++ support (pytorch/pytorch#48959), so please feel free to contribute ideas there as well! |
@osalpekar I agree that we can just stick to MPI for this example and so I removed the NCCL code for now. Regarding the move to ProcessGroupMPI, I made some initial changes to the code based on the c10d example here but it doesn't compile! I have a lot of questions which I put as comments in the code, can you please clarify them? |
Link c10d static library and fix some compilation errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for your work on this example @soumyadipghosh!
@osalpekar Thanks for all your help and thanks to @lasagnaphil for the patches! I have verified the build and everything works great! Ready for merging :) |
@osalpekar So how does this get merged? |
Distributed example on C++ API (LibTorch)
This PR adds an example involving distributed training using MPI on the C++ frontend similar to DistributedDataParallel in Python. This topic was raised in the forums here. Right now, this code is CPU only.
Please let me know if this PR can be a worthwhile contribution.
cc @yf225 @pietern