Skip to content

[ROCm] Increasing the rpc timeout #894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

pruthvistony
Copy link
Contributor

  • Higher timeout is required when running on ROCm
    as the required kernels are compiled at runtime.

@pruthvistony
Copy link
Contributor Author

pruthvistony commented Mar 17, 2021

@jeffdaily @sunway13
Please review

@pruthvistony pruthvistony changed the title Increasing the rpc timeout [ROCm] Increasing the rpc timeout Mar 17, 2021
@pruthvistony
Copy link
Contributor Author

@pritamdamania , @mrshenli, @vishwakftw
Can you please review these changes.

@pritamdamania87
Copy link
Contributor

@pruthvistony I guess the default timeout is 60s, does it actually take > 60s to compile the kernels? I was wondering if you could point us to the line which actually times out with the defaults? I just wanted to double check if there wasn't something else that might be causing this.

@pruthvistony
Copy link
Contributor Author

@pritamdamania87 ,
I get the below exec logs
root@ixt-hw-01:/var/lib/jenkins/examples/distributed/rpc/pipeline# python main.py
Processing batch 0
[W tensorpipe_agent.cpp:546] RPC agent for worker1 encountered error when reading incoming request from master: EOF: end of file (this is expected to happen during shutdown)
[W tensorpipe_agent.cpp:546] RPC agent for worker2 encountered error when reading incoming request from master: EOF: end of file (this is expected to happen during shutdown)
[W tensorpipe_agent.cpp:546] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
Traceback (most recent call last):
File "main.py", line 249, in
mp.spawn(run_worker, args=(world_size, num_split), nprocs=world_size, join=True)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 231, in run_worker
run_master(num_split)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 214, in run_master
outputs = model(inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 879, in _call_impl
result = self.forward(*input, **kwargs)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 169, in forward
return torch.cat(torch.futures.wait_all(out_futures))
File "/opt/conda/lib/python3.6/site-packages/torch/futures/init.py", line 196, in wait_all
return [fut.wait() for fut in torch._C._collect_all(cast(List[torch._C.Future], futures)).wait()]
RuntimeError: RPCErr:1:RPC ran for more than set timeout (60000 ms) and will now be marked with an error

Breaking in rpc_async() while processing - ResNetShard2 times out.
In case of ROCm, with increased timeout once the test case is executed the complied kernel are stored in cache and every subsequent execution works properly with 60 sec timeout. Experimented with different timeout and found > 300 sec works for ROCm. Believe in case of CUDA all/most of the kernels are available at the start and 60 sec timeout is sufficient.

@@ -219,7 +219,7 @@ def run_master(split_size):
def run_worker(rank, world_size, num_split):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256)
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256, rpc_timeout=300)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here explaining why we need the 300s timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commit with comments to explain the change.

- Higher timeout is required when running on ROCm
  as the required kernels are compiled at runtime.
@pritamdamania87 pritamdamania87 merged commit af11138 into pytorch:master Mar 22, 2021
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
- Higher timeout is required when running on ROCm
  as the required kernels are compiled at runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants