-
Notifications
You must be signed in to change notification settings - Fork 903
Refactor the request completion #1422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -92,7 +90,6 @@ static int mca_pml_ob1_recv_request_free(struct ompi_request_t** request) | |||
MCA_PML_OB1_RECV_REQUEST_RETURN( recvreq ); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the calls to OPAL_THREAD_LOCK removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no locking anymore. Everything related to the synchronization is now handled down in the sync_wait, via atomic operations.
bot:retest |
@jsquyres I think this is too risky for 2.0.0 |
@bosilca please have your student rebase down to a few commits. |
Performance tests show almost 400% message rate increase over vader with 4 threads per process. Increasing the number of threads only increases this performance boost. |
@hppritcha I agree that this is too big for v2.0.0 -- I think v2.0.1 at the earliest. @bosilca What's the merge commit for in the middle? That does not seem correct at all. Also, it would be really good to clean up this patch set a bunch -- there's a whole lot of intermediate commits (e.g., "foo" is added in commit A, but then removed in commit B). Can this be squashed to a smaller number of logical commits? (it probably doesn't make sense to squash it down to 1 commit, but something less than 30 would be good...) |
@bosilca tells me that this PR is not yet ready -- they're still working on the code. For the moment, I'm deferring this to v2.1.0. |
@bosilca Does this PR represent all the commits that went into master? The merge commit near the bottom of the list seems weird / wrong... |
Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signalled).
bot:retest |
@Di0gen @jladd-mlnx UCX builds still seem to be broken in the Mellanox Jenkins. |
Per discussion on the 2015-05-24 webex, we decided:
|
@bosilca After the webex conversation, however, I see that the Mellanox Jenkins failed an OSHMEM test with this PR. It looks like a legit failure -- can you please investigate? |
@jsquyres looking at the jenkins failure indicates the issue as being in the RTE, more precisely in the OOB. Here is the only usable stack I can get from the Mellanox output.
|
Something is very off here, guys - look at the stack trace. The error is in oshrun, not in an application process. Unless Mellanox has done something very odd, there is no way that a change in the MPI request can cause oshrun to fail. |
@rhc54 The question is whether the application processes have started or not. They certainly haven't produced any output. |
@rhc54 BTW, oshrun crashed because the mellanox jenkins sends a SEGV on timeout. |
Oh, wow -- we've been chasing a SEGV, but the SEGV was coming from Mellanox's scripts itself (note the parameter to the "timeout" command):
So it looks like the real issue is a hang. |
@jladd-mlnx Yep -- fingers crossed! Can we change the timeout signal to something other than SEGV? That misled us for a while today; it might be helpful in the future to not have to remember that the SEGV isn't actually a SEGV. Thanks! |
@jsquyres - the reason we send SEGV is to make mpirun dump the stacktrace. |
SIGABRT? |
SIGABRT is not interceptable by mpirun and just will kill it w/o stacktrace. |
@jladd-mlnx Would it be better to grab a stack trace with something like
|
@jladd-mlnx Oops -- I mean to tag the prior comment to both @jladd-mlnx and @miked-mellanox. I'm apparently still not awake yet. Sorry! |
the problem is how to detect hang. mpirun has progress thread which can dump stack after user-provided time limit. With your proposal - one need to develop watchdog client and integrate it - it is very error prune IMO. |
We currently have no plans to change the script. |
For the public record, there were several inaccurate and misleading comments posted to this ticket. OSHMEM has nothing to do with this failure. A hang in the CM PML was triggered during an OSHMEM hello world test because OSHMEM init uses MPI collectives and, hence, the PML to bootstrap itself once MPI is init-ed. The root cause of the hang was because of a lack of support in the CM PML for the new request handling. The combination of OSHMEM with CM PML MXM MTL and IKRIT SML just happened to trigger the hang. A fix for this issue was made in #1716, however, subsequent testing has exposed another bug in the OpenIB BTL that needs to be fixed as well. Thanks to Mellanox Jenkins for catching this. |
Just a little further clarification: part of the reason for the confusion was the use of an induced segfault in the Mellanox Jenkins script to initiate a timeout shutdown of oshrun. This provides a false positive in a portion of the code totally unrelated to the submitted PR, leaving the impression that the test, and not the code, was the root cause. It appears we are unable to reach agreement on how best to avoid this going forward. It therefore remains a reasonable possibility that this problem will repeat in the future, pending agreement on a better solution. |
The resolution going forward is the following; any community member is more than welcome to access the Mellanox Jenkins server and is encouraged to do so. All that I need is your public RSA key, and I will be happy to add it to the list of authorized keys. Once added, you will simply need to:
And then to the Jenkins server:
This takes Mellanox out of the critical path of other organizations' projects and road maps. |
Good to know, but not likely to be useful, to be honest. Many of us are in the same situation as you and don't have the time to chase apparently unrelated issues, and there are legal concerns for some members regarding accessing Mellanox machines. So I don't think that will provide a practical solution. Probably best for the community to discuss this offline. |
The offer stands. It's the best we can do. |
@miked-mellanox Probably using a tool like PADB (http://padb.pittman.org.uk/) to collect the stack trace after waiting for a timeout period is one option. I have used the code from https://code.google.com/archive/p/padb/source and that seems to work. |
yep. this tool does not fit jenkins framework. One need to write complicated logic to detect mpirun hang and fork padb to extract stack and properly communicate status to jenkins. |
Yeah, I'm not sure it's worth all that effort. The problem we really need to solve is to tell us that the test hung - getting a stack trace from mpirun won't tell us much about why as mpirun isn't the source of the problem. I think a simple error statement that the test timed out like we do for MTT would be more useful and create less "false positive" confusion. |
yep, |
Yeah, we should think about this a bit. What we really want to know is that we timed out because the applications never terminated, as opposed to we timed out because of a problem in mpirun itself. The problem with just generating an mpirun stack trace is that it won't tell you anything if the problem is in the application. All it can show is that mpirun is merrily waiting for something to happen. Perhaps the better solution is to provide some way of asking mpirun to output a state report - i.e., what does it think the current state of the job is? Does it think all procs terminated and it is just hung trying to exit? Or does it think procs are still running? Another thing we could get is a simple "ps" output. Are there procs still running? Or have all the apps exited? This should be simple to get, and would provide more info than a stack trace. If we can do some of that, it would be helpful to avoid the confusion. |
yep,something like that would be very helpful to all. achieving same effect with jenkins and external watchdog processes will make things worse, because it is hard to orchestrate inside jenkins single process. % mpirun -dump-stack-from-all-ranks-and-exit-with-error-after-time 10m imb.exe |
Might be doable - I can take a crack at ti. |
@rhc54 Perhaps this is something you could daisy chain off the $MPIEXEC_TIMEOUT functionality...? |
😆 that is exactly what I am coding right this second |
What do you think of this: $ mpirun -n 4 --map-by node -timeout 2 --report-state-on-timeout ./mpi_spin
--------------------------------------------------------------------------
The user-provided time limit for job execution has been
reached:
MPIEXEC_TIMEOUT: NULL seconds
The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment or -timeout on the cmd line).
--------------------------------------------------------------------------
DATA FOR JOB: [42410,0]
Num apps: 1 Num procs: 2 JobState: ALL DAEMONS REPORTED Abort: False
Num launched: 0 Num reported: 2 Num terminated: 0
Procs:
Rank: 0 Node: bend001 PID: 10484 State: RUNNING ExitCode 0
Rank: 1 Node: bend003 PID: 10489 State: RUNNING ExitCode 0
DATA FOR JOB: [42410,1]
Num apps: 1 Num procs: 4 JobState: SYNC REGISTERED Abort: False
Num launched: 4 Num reported: 4 Num terminated: 0
Procs:
Rank: 0 Node: bend001 PID: 10490 State: SYNC REGISTERED ExitCode 0
Rank: 1 Node: bend003 PID: 28765 State: SYNC REGISTERED ExitCode 0
Rank: 2 Node: bend001 PID: 10491 State: SYNC REGISTERED ExitCode 0
Rank: 3 Node: bend003 PID: 28766 State: SYNC REGISTERED ExitCode 0 |
* Remodel the request. Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signaled). * Fix the condition to release the request. (cherry picked from open-mpi/ompi@b90c838) Signed-off-by: Nathan Hjelm <[email protected]>
* Remodel the request. Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signaled). * Fix the condition to release the request.
The current code to handle request completions (for both single and multi-requests calls) is complex and highly inefficient. It has a big impact on our message rate for single threaded applications, and prevents good performance on multi-threaded cases (global lock for completion, wakes all threads upon each request completions, and many more issues).
This patch fixes all cases. First, it refactors the completion code and the link between the completion and opal_progress. It introduces a new synchronization primitive, that is the completion status of the request (COMPLETE, PENDING or a synchronization primitive). As a result the request status is now handled via atomic operations, at least one (to mark the request as completed) and at most 2 per request (one to set the synchronization primitive and one to mark it as complete). Second, refactor all wait* and test* functions to reduce the number of iterations over the array of requests (if any).