Skip to content

Request fixes #1716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 26, 2016
Merged

Request fixes #1716

merged 3 commits into from
May 26, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented May 25, 2016

No description provided.

hjelmn added 2 commits May 25, 2016 15:34
This fixes an error when building with --enable-static.

Signed-off-by: Nathan Hjelm <[email protected]>
This fixes a hang caused by the request refactor work. The cm pml was
not updated and was hanging is most cases.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

@bosilca, @jladd-mlnx Two problems. One causing compilation failure and the other causing hangs.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

Hmm, yalla isn't hanging but needs an update as well. incoming.

This commit brings the pml/yalla component up to date with the request
rework changes.

Signed-off-by: Nathan Hjelm <[email protected]>
@jladd-mlnx
Copy link
Member

@hjelmn Just so I understand. The root cause was because of a lack of support in the CM PML. The combination of OSHMEM with CM PML MXM MTL and IKRIT SML just happened to trigger the hang. OSHMEM itself was not involved in the hang. Correct?

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

Looks like it.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

Need jenkins to confirm.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

@bosilca I still see a lot of references to ompi_request_lock in ompi. I think most of them can go away. Is there a reason those were left?

@bosilca
Copy link
Member

bosilca commented May 25, 2016

In all instances where the request lock protects the request_complete it is not necessary anymore. @thananon started to remove them but apparently he stopped after ob1.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

@thananon Should look at the ones in pml/ucx and see if they are still relevant. I got yalla and cm.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

Unrelated error? error posting send request error 12

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

:bot:retest:

@bosilca
Copy link
Member

bosilca commented May 25, 2016

The second part of the error is more interesting "Cannot allocate memory. size = 14"

@jladd-mlnx
Copy link
Member

jladd-mlnx commented May 25, 2016

It's an OpenIB error on PML OB1. It's a thread multiple test.

taskset -c 6,7 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/latency_th 8
01:29:40 Size (bytes)    Time (us)
01:29:40 [jenkins01][[61884,1],0][btl_openib_endpoint.c:115:mca_btl_openib_endpoint_post_send] error posting send request error 12: Cannot allocate memory. size = 14
01:29:40 
01:29:40 [jenkins01][[61884,1],0][btl_openib_endpoint.c:115:mca_btl_openib_endpoint_post_send] error posting send request error 12: Cannot allocate memory. size = 14
01:29:40 
01:29:40 [jenkins01][[61884,1],0][btl_openib_endpoint.c:115:mca_btl_openib_endpoint_post_send] error posting send request error 12: Cannot allocate memory. size = 14
01:29:40 
01:29:40 *** glibc detected *** /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/latency_th: munmap_chunk(): invalid pointer: 0x00007fff9c0058a0 ***
01:29:40 [jenkins01][[61884,1],0][btl_openib_endpoint.c:115:mca_btl_openib_endpoint_post_send] error posting send request error 12: Cannot allocate memory. size = 14
01:29:40

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

@jladd-mlnx Is that a jenkins machine failure or legitimate?

@jladd-mlnx
Copy link
Member

jladd-mlnx commented May 25, 2016

@hjelmn Something random. It ran once, and now it's failing repeatedly. The OSHMEM command line works, however. Jenkins is in fine health. This is a legitimate failure.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented May 25, 2016

@bosilca @hjelmn It doesn't reproduce the failure with CM PML but grinds very slowly. It seems to work beautifully with Yalla PML. Fast and safe. Only OB1 fails almost always (but not always) with the weird, random allocation error. Looks like something isn't thread safe.

@jladd-mlnx
Copy link
Member

@bosilca @hjelmn It's an XRC failure.

@hjelmn
Copy link
Member Author

hjelmn commented May 25, 2016

@jladd-mlnx General XRC failure or openib btl XRC failure?

@jladd-mlnx
Copy link
Member

@hjelmn OpenIB BTL XRC failure.

@hjelmn
Copy link
Member Author

hjelmn commented May 26, 2016

Well, since this is a different failure I will merge this. Will have to dissect the other failure tomorrow.

@hjelmn hjelmn merged commit 5d32217 into open-mpi:master May 26, 2016
@jsquyres
Copy link
Member

@hjelmn I'm afraid @thananon doesn't know anything about UCX. Who's available from the UCX side who can help here?

@bosilca
Copy link
Member

bosilca commented May 26, 2016

I might be able to get to it either Friday afternoon, or early next week.

@thananon
Copy link
Member

@jsquyres @bosilca I think I can get to it today. I just have a few things to make sure about UCX. I will PM you guys for advice.

@thananon
Copy link
Member

#1719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants