Distributed example on C++ API (LibTorch) #809

soumyadipghosh · 2020-08-01T20:25:03Z

This PR adds an example involving distributed training using MPI on the C++ frontend similar to DistributedDataParallel in Python. This topic was raised in the forums here. Right now, this code is CPU only.

Please let me know if this PR can be a worthwhile contribution.

cc @yf225 @pietern

yf225 · 2020-08-02T01:08:05Z

@glaringlee @mrshenli Curious would this be a good addition? Thanks!

glaringlee · 2020-08-03T14:10:32Z

@yf225 This is good to me. I saw there is only 1 check (Run Examples) within this repo, is that as designed?
@mrshenli can you check the logic here, it seems fine to me though.

mrshenli

@glaringlee @yf225 The distributed gradient averaging part looks OK to me. How do we test this in the example repo? And how do we make sure future changes in PyTorch won't break this example?

cc @agolynski for review as well

cpp/distributed/dist-mnist.cpp

glaringlee · 2020-08-03T16:09:24Z

@yf225 @mrshenli
I see there is a CI like test in this repo which run all the examples, any change that fail the execution will be detected in the test. For silence failure (logic change but won't stop the execution), can we add assert within the example so it breaks if any bc breaking (logical) change is made?

mrshenli · 2020-08-03T18:40:25Z

Do we need to add this new test to some .sh file, or will the CI automatically detect and include all new tests?

glaringlee · 2020-08-03T19:04:52Z

Should add dist-mnist after this line I think, define dist-mnist() and do proper cleanup:
https://github.com/pytorch/examples/blob/master/run_python_examples.sh#L189
cc @yf225 @mrshenli @soumyadipghosh

soumyadipghosh · 2020-08-03T19:10:24Z

Also, for running tests, an MPI installation is necessary. How will that be added to the CI pipeline?

glaringlee · 2020-08-03T19:43:34Z

I think the CI is totally controlled by run_python_examples.sh. MPI can be installed when installing dependencies.
cc @seemethere

soumyadipghosh · 2020-08-08T18:06:09Z

cc @yf225 @glaringlee @mrshenli Can someone help me with the next steps ? I am not sure how to add the dependencies and the tests.

glaringlee · 2020-08-13T19:17:08Z

@malfet Are you familiar with this repo? Should we update run_python_examples.sh (add this test in) to make this test running on each code update? cc @mrshenli

soumyadipghosh · 2020-10-01T02:29:42Z

Can someone please help me with the next steps? I think the run_python_examples.sh is only relevant for the python examples. It would be better to create something like run_cpp_examples.sh to test the cpp examples including this one.
cc @yf225 @glaringlee @mrshenli @malfet @seemethere @soumith

osalpekar

Thanks for giving an update on your work in the PyTorch forums! Adding some reviews inline

cpp/distributed/dist-mnist.cpp

facebook-github-bot · 2020-11-29T16:10:24Z

Hi @soumyadipghosh!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

soumyadipghosh · 2020-12-29T00:37:42Z

@osalpekar @mrshenli Since we haven't heard back about the ProcessGroup C++ APIs (see comments above), I created a separate file that involves NCCL for GPU communication that is somewhat based on the NCCL Example 2 here. I am not sure how to combine the CPU and GPU versions into one file because the NCCL routines will lead to errors when run on CPUs. Any ideas on the next steps?

osalpekar · 2021-01-06T00:20:39Z

@osalpekar @mrshenli Since we haven't heard back about the ProcessGroup C++ APIs (see comments above), I created a separate file that involves NCCL for GPU communication that is somewhat based on the NCCL Example 2 here. I am not sure how to combine the CPU and GPU versions into one file because the NCCL routines will lead to errors when run on CPUs. Any ideas on the next steps?

I added a comment above about the ProcessGroup C++ APIs - I think these should significantly simplify some of the initialization boilerplate that's here for NCCL. For the sake of this example, adding support for multiple ProcessGroups might be a bit too much since these examples should demonstrate to users how to efficiently use PyTorch Distributed training - we can stick to MPI for this example and include a comment that NCCL or Gloo can be used as alternatives. I think we should be in good shape to merge this in once we move to the MPI ProcessGroup.

We also have an ongoing discussion about native DDP C++ support (pytorch/pytorch#48959), so please feel free to contribute ideas there as well!

…s stage

cpp/distributed/dist-mnist.cpp

soumyadipghosh · 2021-01-07T04:01:48Z

@osalpekar I agree that we can just stick to MPI for this example and so I removed the NCCL code for now. Regarding the move to ProcessGroupMPI, I made some initial changes to the code based on the c10d example here but it doesn't compile! I have a lot of questions which I put as comments in the code, can you please clarify them?

Link c10d static library and fix some compilation errors

osalpekar

Looks great - thanks for your work on this example @soumyadipghosh!

soumyadipghosh · 2021-01-29T18:27:43Z

@osalpekar Thanks for all your help and thanks to @lasagnaphil for the patches! I have verified the build and everything works great! Ready for merging :)

soumyadipghosh · 2021-01-30T21:02:09Z

@osalpekar So how does this get merged?

Distributed example on C++ API (LibTorch)

Initial commit on a distributed training example in cpp

4c4e64a

mrshenli reviewed Aug 3, 2020

View reviewed changes

removed comments and ran through clang-format

e31e42e

Added Readme

357694d

remove hard-coded path to dataset

dc1afe6

osalpekar reviewed Nov 29, 2020

View reviewed changes

cpp/distributed/dist-mnist.cpp Outdated Show resolved Hide resolved

cpp/distributed/dist-mnist.cpp Show resolved Hide resolved

added comment regarding performance comparison with DDP

b028f4a

facebook-github-bot added the cla signed label Nov 29, 2020

soumyadipghosh added 2 commits December 28, 2020 18:27

fixing a bug in test set path

31fb2fc

adding a separate file for GPU

a0bfb9f

osalpekar mentioned this pull request Jan 6, 2021

[Poll] Support DistributedDataParallel (DDP) in PyTorch C++ API (libtorch) pytorch/pytorch#48959

Open

soumyadipghosh added 2 commits January 6, 2021 22:30

initial attempt at using ProcessGroupMPI; code doesn't compile at thi…

5c3c2bd

…s stage

remove the GPU file to keep PR simple!

44618f0