-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[ROCm] Increasing the rpc timeout #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jeffdaily @sunway13 |
@pritamdamania , @mrshenli, @vishwakftw |
@pruthvistony I guess the default timeout is 60s, does it actually take > 60s to compile the kernels? I was wondering if you could point us to the line which actually times out with the defaults? I just wanted to double check if there wasn't something else that might be causing this. |
@pritamdamania87 , -- Process 0 terminated with the following error: Breaking in rpc_async() while processing - ResNetShard2 times out. |
@@ -219,7 +219,7 @@ def run_master(split_size): | |||
def run_worker(rank, world_size, num_split): | |||
os.environ['MASTER_ADDR'] = 'localhost' | |||
os.environ['MASTER_PORT'] = '29500' | |||
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256) | |||
options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=256, rpc_timeout=300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here explaining why we need the 300s timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the commit with comments to explain the change.
- Higher timeout is required when running on ROCm as the required kernels are compiled at runtime.
b42e753
to
9f9cd44
Compare
- Higher timeout is required when running on ROCm as the required kernels are compiled at runtime.
as the required kernels are compiled at runtime.