-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Added torch.distributed.launch module for easier multi-proc/node distributed job launching #5348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7a6d2fb
to
82a4e85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified that this can also be used like this:
python -m torch.utils.distributed.pytorch_dist_exec ...
In your case you have a development copy of PyTorch and you know where all the files are located, but this is not true for users that e.g. download binaries. They will have to use the command above. It would also be nice to include an example command in the docs, and make the filename less verbose (e.g. shorten it to torch.distributed.start
)
Helper function parsing the command line options | ||
@retval ArgumentParser | ||
""" | ||
parser = ArgumentParser(description="PyTorch Exec is a helper utiliy that " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
"training currently only supports the NCCL distributed backend. " | ||
"This utilty helper will require that training script is able to " | ||
"parse --device=X as an argument since it will be injected by this " | ||
"utility. " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
"127.0.0.1") | ||
parser.add_argument("--master_port", default=29500, type=int, | ||
help="Master node (rank 0)'s free port that needs to be used for " | ||
"communciation in distributed training") |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
python -m torch.distributed.launch |
4a1e644
to
1464ca4
Compare
1464ca4
to
eaabab4
Compare
torch/distributed/launch.py
Outdated
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE | ||
--num_node=2 --rank_node=1 --master_addr="192.168.1.1" | ||
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and | ||
all other arguments of your training script) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
module: | ||
|
||
torch.distributed.init_process_group(backend='YOUR BACKEND', | ||
"init_method='env://') |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
"init_method='env://') | ||
|
||
(4) In your training program, you are supposed to convert your model to | ||
DistributedDataParallel module using the following function. Please ensure |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
device_ids=[arg.device]) | ||
|
||
(5) For multi-node training, current we only support nodes with identical number | ||
of GPUs. In other words, the number of GPUs on each node needs to be the same. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
(5) For multi-node training, current we only support nodes with identical number | ||
of GPUs. In other words, the number of GPUs on each node needs to be the same. | ||
|
||
""" |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
parser.add_argument("--rank_node", type=int, default=0, | ||
help="The rank of the node for multi-node distributed " | ||
"training") | ||
parser.add_argument("--nproc_per_node", type=int, default=-1, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
|
||
|
||
args = parse_args() | ||
num_gpus = torch.cuda.device_count() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/distributed/launch.py
Outdated
parser.add_argument("--num_node", type=int, default=1, | ||
help="The number of nodes to use for distributed " | ||
"training") | ||
parser.add_argument("--rank_node", type=int, default=0, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
help="The rank of the node for multi-node distributed " | ||
"training") | ||
parser.add_argument("--nproc_per_node", type=int, default=1, | ||
help="The number of processes to launch on each node, " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@teng-li but the the next line "will default to the number of GPUs on your system if not specified" is not correct, and single-node example in line 25 won't do what users expect it to do. I don't mind it not being set (though to me it would still feel more natural if launcher helper took care of using all available GPUs), but it should be reflected in the documentation and examples. |
@pytorchbot retest this please |
3 similar comments
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
torch/distributed/launch.py
Outdated
# spawn the processes | ||
cmd = ["python", | ||
args.training_script, | ||
"--device={}".format(local_rank)] + args.training_script_args |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
1. This utilty and multi-process distributed (single-node or | ||
multi-node) GPU training currently only achieves the best performance using | ||
the NCCL distributed backend. Thus NCCL backend is the recommended backend to | ||
use for GPU training. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
A helper module to launch multi-process distributed jobs either on a single-node or multiple-node
can be used with
pytorch/examples#306
For example: single node multi-process training:
Multi-node multi-process training would be similar using