[ROCm] Increasing the rpc timeout #894

pruthvistony · 2021-03-17T05:20:27Z

Higher timeout is required when running on ROCm
as the required kernels are compiled at runtime.

pruthvistony · 2021-03-17T05:21:02Z

@jeffdaily @sunway13
Please review

pruthvistony · 2021-03-22T17:07:26Z

@pritamdamania , @mrshenli, @vishwakftw
Can you please review these changes.

pritamdamania87 · 2021-03-22T18:31:45Z

@pruthvistony I guess the default timeout is 60s, does it actually take > 60s to compile the kernels? I was wondering if you could point us to the line which actually times out with the defaults? I just wanted to double check if there wasn't something else that might be causing this.

pruthvistony · 2021-03-22T19:28:57Z

@pritamdamania87 ,
I get the below exec logs
root@ixt-hw-01:/var/lib/jenkins/examples/distributed/rpc/pipeline# python main.py
Processing batch 0
[W tensorpipe_agent.cpp:546] RPC agent for worker1 encountered error when reading incoming request from master: EOF: end of file (this is expected to happen during shutdown)
[W tensorpipe_agent.cpp:546] RPC agent for worker2 encountered error when reading incoming request from master: EOF: end of file (this is expected to happen during shutdown)
[W tensorpipe_agent.cpp:546] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
Traceback (most recent call last):
File "main.py", line 249, in
mp.spawn(run_worker, args=(world_size, num_split), nprocs=world_size, join=True)
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 231, in run_worker
run_master(num_split)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 214, in run_master
outputs = model(inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 879, in _call_impl
result = self.forward(*input, **kwargs)
File "/var/lib/jenkins/examples/distributed/rpc/pipeline/main.py", line 169, in forward
return torch.cat(torch.futures.wait_all(out_futures))
File "/opt/conda/lib/python3.6/site-packages/torch/futures/init.py", line 196, in wait_all
return [fut.wait() for fut in torch._C._collect_all(cast(List[torch._C.Future], futures)).wait()]
RuntimeError: RPCErr:1:RPC ran for more than set timeout (60000 ms) and will now be marked with an error

Breaking in rpc_async() while processing - ResNetShard2 times out.
In case of ROCm, with increased timeout once the test case is executed the complied kernel are stored in cache and every subsequent execution works properly with 60 sec timeout. Experimented with different timeout and found > 300 sec works for ROCm. Believe in case of CUDA all/most of the kernels are available at the start and 60 sec timeout is sufficient.

pritamdamania87 · 2021-03-22T19:40:30Z

distributed/rpc/pipeline/main.py

@@ -219,7 +219,7 @@ def run_master(split_size):
 def run_worker(rank, world_size, num_split):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
-    options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256)
+    options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256, rpc_timeout=300)


Could you add a comment here explaining why we need the 300s timeout?

Updated the commit with comments to explain the change.

- Higher timeout is required when running on ROCm as the required kernels are compiled at runtime.

facebook-github-bot added the cla signed label Mar 17, 2021

pruthvistony changed the title ~~Increasing the rpc timeout~~ [ROCm] Increasing the rpc timeout Mar 17, 2021

pritamdamania87 approved these changes Mar 22, 2021

View reviewed changes

Increasing the rpc timeout

9f9cd44

- Higher timeout is required when running on ROCm as the required kernels are compiled at runtime.

pruthvistony force-pushed the rpc_timeout branch from b42e753 to 9f9cd44 Compare March 22, 2021 19:49

pritamdamania87 merged commit af11138 into pytorch:master Mar 22, 2021

YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025

Increasing the rpc timeout (pytorch#894)

684e246

- Higher timeout is required when running on ROCm as the required kernels are compiled at runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Increasing the rpc timeout #894

[ROCm] Increasing the rpc timeout #894

Uh oh!

pruthvistony commented Mar 17, 2021

Uh oh!

pruthvistony commented Mar 17, 2021 •

edited

Loading

Uh oh!

pruthvistony commented Mar 22, 2021

Uh oh!

pritamdamania87 commented Mar 22, 2021

Uh oh!

pruthvistony commented Mar 22, 2021

Uh oh!

pritamdamania87 Mar 22, 2021

Uh oh!

pruthvistony Mar 22, 2021

Uh oh!

Uh oh!

[ROCm] Increasing the rpc timeout #894

[ROCm] Increasing the rpc timeout #894

Uh oh!

Conversation

pruthvistony commented Mar 17, 2021

Uh oh!

pruthvistony commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony commented Mar 22, 2021

Uh oh!

pritamdamania87 commented Mar 22, 2021

Uh oh!

pruthvistony commented Mar 22, 2021

Uh oh!

pritamdamania87 Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

pruthvistony Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pruthvistony commented Mar 17, 2021 •

edited

Loading