Added torch.distributed.launch module for easier multi-proc/node distributed job launching #5348

teng-li · 2018-02-22T09:00:58Z

A helper module to launch multi-process distributed jobs either on a single-node or multiple-node

$python -m torch.distributed.launch --help
usage: launch.py [-h] [--num_node NUM_NODE] [--rank_node RANK_NODE]
                 [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
                 [--master_port MASTER_PORT] [--dist_backend DIST_BACKEND]
                 training_script ...

PyTorch distributed training launch helper utilty that will spawn up multiple
distributed processes

positional arguments:
  training_script       The full path to the single GPU training
                        program/script to be launched in parallel, followed by
                        all the arguments for the training script
  training_script_args

optional arguments:
  -h, --help            show this help message and exit
  --num_node NUM_NODE   The number of nodes to use for distributed training
  --rank_node RANK_NODE
                        The rank of the node for multi-node distributed
                        training
  --nproc_per_node NPROC_PER_NODE
                        The number of processes to launch on each node
  --master_addr MASTER_ADDR
                        Master node (rank 0)'s address, should be either the
                        IP address or the hostname of node 0, for single node
                        multi-proc training, the --master_addr can simply be
                        127.0.0.1
  --master_port MASTER_PORT
                        Master node (rank 0)'s free port that needs to be used
                        for communciation during distributed training

can be used with

pytorch/examples#306

For example: single node multi-process training:

python -m torch.distributed.launch ./main.py -j 0 -a resnet18 --print-freq 1 --batch-size 32 --dist-url 'env://' /datasets01/imagenet_full_size/061417/ --epochs 1 --dist-backend 'nccl''

Multi-node multi-process training would be similar using

python -m torch.distributed.launch --num_node=2 --rank_node=0 --nproc_per_node=2 --master_addr=devfair033 ./main.py -j 0 -a resnet18 --print-freq 1 --batch-size 32 --dist-url 'env://' /datasets01/imagenet_full_size/061417/ --epochs 1 --dist-backend 'nccl'

apaszke

Have you verified that this can also be used like this:

python -m torch.utils.distributed.pytorch_dist_exec ...

In your case you have a development copy of PyTorch and you know where all the files are located, but this is not true for users that e.g. download binaries. They will have to use the command above. It would also be nice to include an example command in the docs, and make the filename less verbose (e.g. shorten it to torch.distributed.start)

tools/distributed/pytorch_dist_exec.py

+    Helper function parsing the command line options
+    @retval ArgumentParser
+    """
+    parser = ArgumentParser(description="PyTorch Exec is a helper utiliy that "


tools/distributed/pytorch_dist_exec.py

+            "training currently only supports the NCCL distributed backend. "
+            "This utilty helper will require that training script is able to "
+            "parse --device=X as an argument since it will be injected by this "
+            "utility. "


tools/distributed/pytorch_dist_exec.py

+            "127.0.0.1")
+    parser.add_argument("--master_port", default=29500, type=int,
+            help="Master node (rank 0)'s free port that needs to be used for "
+            "communciation in distributed training")


soumith · 2018-02-22T16:07:49Z

what adam said.
Additionally change name to:

python -m torch.distributed.launch

torch/distributed/launch.py

+        python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
+        --num_node=2 --rank_node=1 --master_addr="192.168.1.1"
+        --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and
+        all other arguments of your training script)


torch/distributed/launch.py

+module:
+
+    torch.distributed.init_process_group(backend='YOUR BACKEND',
+                                         "init_method='env://')


torch/distributed/launch.py

+                                         "init_method='env://')
+
+(4) In your training program, you are supposed to convert your model to
+DistributedDataParallel module using the following function. Please ensure


torch/distributed/launch.py

+                                                      device_ids=[arg.device])
+
+(5) For multi-node training, current we only support nodes with identical number
+of GPUs. In other words, the number of GPUs on each node needs to be the same.


torch/distributed/launch.py

+(5) For multi-node training, current we only support nodes with identical number
+of GPUs. In other words, the number of GPUs on each node needs to be the same.
+
+"""


torch/distributed/launch.py

+    parser.add_argument("--rank_node", type=int, default=0,
+                        help="The rank of the node for multi-node distributed "
+                             "training")
+    parser.add_argument("--nproc_per_node", type=int, default=-1,


torch/distributed/launch.py

+
+
+args = parse_args()
+num_gpus = torch.cuda.device_count()


torch/distributed/launch.py

+    parser.add_argument("--num_node", type=int, default=1,
+                        help="The number of nodes to use for distributed "
+                             "training")
+    parser.add_argument("--rank_node", type=int, default=0,


torch/distributed/launch.py

+                        help="The rank of the node for multi-node distributed "
+                             "training")
+    parser.add_argument("--nproc_per_node", type=int, default=1,
+                        help="The number of processes to launch on each node, "


teng-li · 2018-03-06T18:49:33Z

@ngimel I deleted it because we would like this tool to work on cpu training as well.

@apaszke . Mind taking another look?

ngimel · 2018-03-06T19:24:40Z

@teng-li but the the next line "will default to the number of GPUs on your system if not specified" is not correct, and single-node example in line 25 won't do what users expect it to do. I don't mind it not being set (though to me it would still feel more natural if launcher helper took care of using all available GPUs), but it should be reflected in the documentation and examples.

ezyang · 2018-03-06T22:55:29Z

@pytorchbot retest this please

ezyang · 2018-03-06T23:03:31Z

@pytorchbot retest this please

ezyang · 2018-03-06T23:10:28Z

@pytorchbot retest this please

ezyang · 2018-03-06T23:12:42Z

@pytorchbot retest this please

torch/distributed/launch.py

+    # spawn the processes
+    cmd = ["python",
+           args.training_script,
+           "--device={}".format(local_rank)] + args.training_script_args


torch/distributed/launch.py

+1. This utilty and multi-process distributed (single-node or
+multi-node) GPU training currently only achieves the best performance using
+the NCCL distributed backend. Thus NCCL backend is the recommended backend to
+use for GPU training.


onnxbot-worker-3 mentioned this pull request Feb 22, 2018

[auto] pytorch-pr-5348 onnxbot/onnx-fb-universe#762

Closed

Added pytorch_dist_exec utiliy for easier distributed job launching

82a4e85

teng-li force-pushed the pytorch_exec branch from 7a6d2fb to 82a4e85 Compare February 22, 2018 09:02

apaszke reviewed Feb 22, 2018

View reviewed changes

teng-li changed the title ~~Added pytorch_dist_exec utiliy for easier distributed job launching~~ Added torch.distributed.launch module for easier multi-proc/node distributed job launching Feb 22, 2018

teng-li force-pushed the pytorch_exec branch 4 times, most recently from 4a1e644 to 1464ca4 Compare February 23, 2018 00:30

Use as python -m torch.distributed.launch

eaabab4

teng-li force-pushed the pytorch_exec branch from 1464ca4 to eaabab4 Compare February 23, 2018 00:33

apaszke reviewed Feb 27, 2018

View reviewed changes

teng-li force-pushed the pytorch_exec branch from b7da134 to 82f8ea2 Compare March 6, 2018 01:26

Docstring formatting and comment addressing

5179d0a

teng-li force-pushed the pytorch_exec branch from 82f8ea2 to 5179d0a Compare March 6, 2018 01:34

ngimel reviewed Mar 6, 2018

View reviewed changes

Addressed Natalia's comments

fb5189a

apaszke reviewed Mar 9, 2018

View reviewed changes

torch/distributed/launch.py Outdated

# spawn the processes

cmd = ["python",

args.training_script,

"--device={}".format(local_rank)] + args.training_script_args

This comment was marked as off-topic.

Sign in to view

apaszke approved these changes Mar 12, 2018

View reviewed changes

Used --local_rank argument to specify the device

03b219b

teng-li force-pushed the pytorch_exec branch from 315545c to 03b219b Compare March 12, 2018 19:58

apaszke merged commit 37059ba into pytorch:master Mar 13, 2018

acgtyrant reviewed Jul 28, 2018

View reviewed changes

pietern mentioned this pull request Jan 29, 2019

Pass torch.distributed launch process local rank as environment variable instead of argument #16360

Closed



		args = parse_args()
		num_gpus = torch.cuda.device_count()

Added torch.distributed.launch module for easier multi-proc/node distributed job launching #5348

Added torch.distributed.launch module for easier multi-proc/node distributed job launching #5348

Uh oh!

Conversation

teng-li commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

soumith commented Feb 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Mar 6, 2018

Uh oh!

ezyang commented Mar 6, 2018

Uh oh!

ezyang commented Mar 6, 2018

Uh oh!

ezyang commented Mar 6, 2018

Uh oh!

ezyang commented Mar 6, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

teng-li commented Feb 22, 2018 •

edited

Loading

soumith commented Feb 22, 2018 •

edited

Loading

teng-li commented Mar 6, 2018 •

edited

Loading