Refactor the request completion #1422

bosilca · 2016-03-02T18:05:19Z

The current code to handle request completions (for both single and multi-requests calls) is complex and highly inefficient. It has a big impact on our message rate for single threaded applications, and prevents good performance on multi-threaded cases (global lock for completion, wakes all threads upon each request completions, and many more issues).

This patch fixes all cases. First, it refactors the completion code and the link between the completion and opal_progress. It introduces a new synchronization primitive, that is the completion status of the request (COMPLETE, PENDING or a synchronization primitive). As a result the request status is now handled via atomic operations, at least one (to mark the request as completed) and at most 2 per request (one to set the synchronization primitive and one to mark it as complete). Second, refactor all wait* and test* functions to reduce the number of iterations over the array of requests (if any).

hppritcha · 2016-03-03T02:57:56Z

ompi/mca/pml/ob1/pml_ob1_recvreq.c

@@ -92,7 +90,6 @@ static int mca_pml_ob1_recv_request_free(struct ompi_request_t** request)
        MCA_PML_OB1_RECV_REQUEST_RETURN( recvreq );
    }



why are the calls to OPAL_THREAD_LOCK removed?

Because there is no locking anymore. Everything related to the synchronization is now handled down in the sync_wait, via atomic operations.

hppritcha · 2016-03-03T03:03:43Z

bot:retest

hppritcha · 2016-03-03T03:04:03Z

@jsquyres I think this is too risky for 2.0.0

hppritcha · 2016-03-03T03:05:16Z

@bosilca please have your student rebase down to a few commits.

bosilca · 2016-03-03T05:19:17Z

Performance tests show almost 400% message rate increase over vader with 4 threads per process. Increasing the number of threads only increases this performance boost.

jsquyres · 2016-03-03T12:05:33Z

@hppritcha I agree that this is too big for v2.0.0 -- I think v2.0.1 at the earliest.

@bosilca What's the merge commit for in the middle? That does not seem correct at all.

Also, it would be really good to clean up this patch set a bunch -- there's a whole lot of intermediate commits (e.g., "foo" is added in commit A, but then removed in commit B). Can this be squashed to a smaller number of logical commits? (it probably doesn't make sense to squash it down to 1 commit, but something less than 30 would be good...)

jsquyres · 2016-03-08T11:11:45Z

@bosilca tells me that this PR is not yet ready -- they're still working on the code. For the moment, I'm deferring this to v2.1.0.

jsquyres · 2016-04-23T10:34:31Z

@bosilca Does this PR represent all the commits that went into master? The merge commit near the bottom of the list seems weird / wrong...

Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signalled).

jsquyres · 2016-05-23T20:04:59Z

bot:retest

jsquyres · 2016-05-23T20:25:11Z

@Di0gen @jladd-mlnx UCX builds still seem to be broken in the Mellanox Jenkins.

jsquyres · 2016-05-24T16:01:10Z

Per discussion on the 2015-05-24 webex, we decided:

This is a blocker performance bug. We need it for v2.0.0. We/the OMPI community acknowledge that this will further delay v2.0.0.
Let's merge this to master and see how it does.

jsquyres · 2016-05-24T16:01:57Z

@bosilca After the webex conversation, however, I see that the Mellanox Jenkins failed an OSHMEM test with this PR. It looks like a legit failure -- can you please investigate?

bosilca · 2016-05-24T23:17:56Z

@jsquyres looking at the jenkins failure indicates the issue as being in the RTE, more precisely in the OOB. Here is the only usable stack I can get from the Mellanox output.

10:09:19 [jenkins01:02266] *** Process received signal ***
10:09:19 [jenkins01:02266] Signal: Segmentation fault (11)
10:09:19 [jenkins01:02266] Signal code:  (0)
10:09:19 [jenkins01:02266] Failing at address: 0xb20000008d9
10:09:19 oshrun: Forwarding signal 18 to job
10:09:19 [jenkins01:02266] [ 0] /lib64/libpthread.so.0[0x3d6980f710]
10:09:19 [jenkins01:02266] [ 1] /lib64/libc.so.6(__select+0x33)[0x3d690e15e3]
10:09:19 [jenkins01:02266] [ 2] /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_oob_tcp.so(+0xc1ca)[0x7ffff55411ca]
10:09:19 [jenkins01:02266] [ 3] /lib64/libpthread.so.0[0x3d698079d1]
10:09:19 [jenkins01:02266] [ 4] /lib64/libc.so.6(clone+0x6d)[0x3d690e8b6d]

rhc54 · 2016-05-25T20:17:53Z

Something is very off here, guys - look at the stack trace. The error is in oshrun, not in an application process. Unless Mellanox has done something very odd, there is no way that a change in the MPI request can cause oshrun to fail.

hjelmn · 2016-05-25T20:34:26Z

@rhc54 The question is whether the application processes have started or not. They certainly haven't produced any output.

hjelmn · 2016-05-25T20:37:21Z

@rhc54 BTW, oshrun crashed because the mellanox jenkins sends a SEGV on timeout.

jsquyres · 2016-05-25T21:54:46Z

Oh, wow -- we've been chasing a SEGV, but the SEGV was coming from Mellanox's scripts itself (note the parameter to the "timeout" command):

02:59:18 + taskset -c 16,17 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 --mca spml ikrit -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem
03:09:19 [jenkins01:02266] *** Process received signal ***
03:09:19 [jenkins01:02266] Signal: Segmentation fault (11)

So it looks like the real issue is a hang.

jladd-mlnx · 2016-05-25T22:14:45Z

@jsquyres #1716

jsquyres · 2016-05-25T22:25:53Z

@jladd-mlnx Yep -- fingers crossed! Can we change the timeout signal to something other than SEGV? That misled us for a while today; it might be helpful in the future to not have to remember that the SEGV isn't actually a SEGV. Thanks!

mike-dubman · 2016-05-26T05:22:28Z

@jsquyres - the reason we send SEGV is to make mpirun dump the stacktrace.
Is there any way to make it happen w/o sending SEGV? maybe some mca param to dump stack after X min?

hjelmn · 2016-05-26T05:25:42Z

SIGABRT?

mike-dubman · 2016-05-26T05:58:20Z

SIGABRT is not interceptable by mpirun and just will kill it w/o stacktrace.

jsquyres · 2016-05-26T10:32:52Z

@jladd-mlnx Would it be better to grab a stack trace with something like pstack? E.g., here's pstack output from the PID of my emacs:

$ pstack 28231
Thread 2 (Thread 0x2aaab13aa700 (LWP 28232)):
#0  0x0000003491e0e75d in read () from /lib64/libpthread.so.0
#1  0x0000003492e4110b in ?? () from /lib64/libglib-2.0.so.0
#2  0x0000003492e691b4 in ?? () from /lib64/libglib-2.0.so.0
#3  0x0000003491e079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000034916e8b5d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aaaaaee2b60 (LWP 28231)):
#0  0x00000034916e17de in pselect () from /lib64/libc.so.6
#1  0x00000000005b2ce2 in xg_select ()
#2  0x0000000000582d33 in wait_reading_process_output ()
#3  0x00000000004dbd7f in read_decoded_event_from_main_queue.isra.39 ()
#4  0x00000000004dff37 in read_char ()
#5  0x00000000004e0d0f in read_key_sequence.constprop.41 ()
#6  0x00000000004e2900 in command_loop_1 ()
#7  0x0000000000543a1e in internal_condition_case ()
#8  0x00000000004d4f6e in command_loop_2 ()
#9  0x000000000054392b in internal_catch ()
#10 0x00000000004d9497 in recursive_edit_1 ()
#11 0x00000000004d97ad in Frecursive_edit ()
#12 0x00000000004d42a5 in main ()

jsquyres · 2016-05-26T10:33:42Z

@jladd-mlnx Oops -- I mean to tag the prior comment to both @jladd-mlnx and @miked-mellanox. I'm apparently still not awake yet. Sorry!

mike-dubman · 2016-05-26T10:53:31Z

the problem is how to detect hang. mpirun has progress thread which can dump stack after user-provided time limit. With your proposal - one need to develop watchdog client and integrate it - it is very error prune IMO.

jladd-mlnx · 2016-05-26T12:01:51Z

We currently have no plans to change the script.

jladd-mlnx · 2016-05-26T14:16:10Z

For the public record, there were several inaccurate and misleading comments posted to this ticket. OSHMEM has nothing to do with this failure. A hang in the CM PML was triggered during an OSHMEM hello world test because OSHMEM init uses MPI collectives and, hence, the PML to bootstrap itself once MPI is init-ed. The root cause of the hang was because of a lack of support in the CM PML for the new request handling. The combination of OSHMEM with CM PML MXM MTL and IKRIT SML just happened to trigger the hang. A fix for this issue was made in #1716, however, subsequent testing has exposed another bug in the OpenIB BTL that needs to be fixed as well.

Thanks to Mellanox Jenkins for catching this.

rhc54 · 2016-05-26T14:43:48Z

Just a little further clarification: part of the reason for the confusion was the use of an induced segfault in the Mellanox Jenkins script to initiate a timeout shutdown of oshrun. This provides a false positive in a portion of the code totally unrelated to the submitted PR, leaving the impression that the test, and not the code, was the root cause.

It appears we are unable to reach agreement on how best to avoid this going forward. It therefore remains a reasonable possibility that this problem will repeat in the future, pending agreement on a better solution.

jladd-mlnx · 2016-05-26T14:50:00Z

The resolution going forward is the following; any community member is more than welcome to access the Mellanox Jenkins server and is encouraged to do so. All that I need is your public RSA key, and I will be happy to add it to the list of authorized keys. Once added, you will simply need to:

$ssh [email protected]

And then to the Jenkins server:

$ssh jenkins01

This takes Mellanox out of the critical path of other organizations' projects and road maps.

rhc54 · 2016-05-26T15:02:55Z

Good to know, but not likely to be useful, to be honest. Many of us are in the same situation as you and don't have the time to chase apparently unrelated issues, and there are legal concerns for some members regarding accessing Mellanox machines. So I don't think that will provide a practical solution.

Probably best for the community to discuss this offline.

jladd-mlnx · 2016-05-26T15:04:57Z

The offer stands. It's the best we can do.

nysal · 2016-05-27T03:29:38Z

@miked-mellanox Probably using a tool like PADB (http://padb.pittman.org.uk/) to collect the stack trace after waiting for a timeout period is one option. I have used the code from https://code.google.com/archive/p/padb/source and that seems to work.

mike-dubman · 2016-05-27T04:30:43Z

yep. this tool does not fit jenkins framework.

One need to write complicated logic to detect mpirun hang and fork padb to extract stack and properly communicate status to jenkins.

rhc54 · 2016-05-27T04:43:57Z

Yeah, I'm not sure it's worth all that effort. The problem we really need to solve is to tell us that the test hung - getting a stack trace from mpirun won't tell us much about why as mpirun isn't the source of the problem. I think a simple error statement that the test timed out like we do for MTT would be more useful and create less "false positive" confusion.

mike-dubman · 2016-05-27T04:54:14Z

yep, timeout -s SEGV mpirun ... can be replaced w/ timeout -s SIGKILL mpirun and jenkins will reported (in the console) that test killed. The next logical questions from community will be - where it hangs?
So, having mpirun print stacktrace after some internal timeout (if provided) is probably best automated way to satisfy jenkins report and developers questions.

rhc54 · 2016-05-27T06:00:38Z

Yeah, we should think about this a bit. What we really want to know is that we timed out because the applications never terminated, as opposed to we timed out because of a problem in mpirun itself. The problem with just generating an mpirun stack trace is that it won't tell you anything if the problem is in the application. All it can show is that mpirun is merrily waiting for something to happen.

Perhaps the better solution is to provide some way of asking mpirun to output a state report - i.e., what does it think the current state of the job is? Does it think all procs terminated and it is just hung trying to exit? Or does it think procs are still running?

Another thing we could get is a simple "ps" output. Are there procs still running? Or have all the apps exited? This should be simple to get, and would provide more info than a stack trace.

If we can do some of that, it would be helpful to avoid the confusion.

mike-dubman · 2016-05-27T06:10:56Z

yep,something like that would be very helpful to all.

achieving same effect with jenkins and external watchdog processes will make things worse, because it is hard to orchestrate inside jenkins single process.

% mpirun -dump-stack-from-all-ranks-and-exit-with-error-after-time 10m imb.exe

rhc54 · 2016-05-27T06:20:12Z

Might be doable - I can take a crack at ti.

jsquyres · 2016-05-27T15:26:08Z

@rhc54 Perhaps this is something you could daisy chain off the $MPIEXEC_TIMEOUT functionality...?

rhc54 · 2016-05-27T15:29:16Z

😆 that is exactly what I am coding right this second

rhc54 · 2016-05-27T16:45:17Z

What do you think of this:

$ mpirun -n 4 --map-by node -timeout 2 --report-state-on-timeout ./mpi_spin
--------------------------------------------------------------------------
The user-provided time limit for job execution has been
reached:

  MPIEXEC_TIMEOUT: NULL seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment or -timeout on the cmd line).
--------------------------------------------------------------------------
DATA FOR JOB: [42410,0]
    Num apps: 1 Num procs: 2    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 2 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 10484  State: RUNNING  ExitCode 0
        Rank: 1 Node: bend003   PID: 10489  State: RUNNING  ExitCode 0

DATA FOR JOB: [42410,1]
    Num apps: 1 Num procs: 4    JobState: SYNC REGISTERED   Abort: False
    Num launched: 4 Num reported: 4 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 10490  State: SYNC REGISTERED  ExitCode 0
        Rank: 1 Node: bend003   PID: 28765  State: SYNC REGISTERED  ExitCode 0
        Rank: 2 Node: bend001   PID: 10491  State: SYNC REGISTERED  ExitCode 0
        Rank: 3 Node: bend003   PID: 28766  State: SYNC REGISTERED  ExitCode 0

* Remodel the request. Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signaled). * Fix the condition to release the request. (cherry picked from open-mpi/ompi@b90c838) Signed-off-by: Nathan Hjelm <[email protected]>

* Remodel the request. Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signaled). * Fix the condition to release the request.

bosilca added the Severity: critical label Mar 2, 2016

bosilca assigned hjelmn Mar 2, 2016

bosilca added this to the v2.0.0 milestone Mar 2, 2016

hppritcha reviewed Mar 3, 2016
View reviewed changes

jsquyres added the enhancement label Mar 7, 2016

jsquyres modified the milestones: v2.1.0, v2.0.0 Mar 8, 2016

Thananon Patinyasakdikul and others added 2 commits May 19, 2016 16:41

Remodel the request.

b72337d

Added the wait sync primitive and integrate it into the PML and MTL infrastructure. The multi-threaded requests are now significantly less heavy and less noisy (only the threads associated with completed requests are signalled).

Fix the condition to release the request.

fab6683

bosilca force-pushed the request branch from 7ca95d2 to fab6683 Compare May 20, 2016 03:54

Removed ompi_request_lock from pml/ob1 and fix compilation error in UCX.

875c69b

jsquyres added bug Severity: blocker and removed Severity: critical labels May 24, 2016

jsquyres modified the milestones: v2.0.0, v2.1.0 May 24, 2016

jsquyres mentioned this pull request May 24, 2016

Add the request completion refactor to v2.x branch #1696

Closed

		@@ -92,7 +90,6 @@ static int mca_pml_ob1_recv_request_free(struct ompi_request_t** request)
		MCA_PML_OB1_RECV_REQUEST_RETURN( recvreq );
		}

Refactor the request completion #1422

Refactor the request completion #1422

Uh oh!

Conversation

bosilca commented Mar 2, 2016

Uh oh!

hppritcha Mar 3, 2016

Choose a reason for hiding this comment

Uh oh!

bosilca Mar 3, 2016

Choose a reason for hiding this comment

Uh oh!

hppritcha commented Mar 3, 2016

Uh oh!

hppritcha commented Mar 3, 2016

Uh oh!

hppritcha commented Mar 3, 2016

Uh oh!

bosilca commented Mar 3, 2016

Uh oh!

jsquyres commented Mar 3, 2016

Uh oh!

jsquyres commented Mar 8, 2016

Uh oh!

jsquyres commented Apr 23, 2016

Uh oh!

jsquyres commented May 23, 2016

Uh oh!

jsquyres commented May 23, 2016

Uh oh!

jsquyres commented May 24, 2016

Uh oh!

jsquyres commented May 24, 2016

Uh oh!

bosilca commented May 24, 2016

Uh oh!

rhc54 commented May 25, 2016

Uh oh!

hjelmn commented May 25, 2016

Uh oh!

hjelmn commented May 25, 2016

Uh oh!

jsquyres commented May 25, 2016

Uh oh!

jladd-mlnx commented May 25, 2016

Uh oh!

jsquyres commented May 25, 2016

Uh oh!

mike-dubman commented May 26, 2016

Uh oh!

hjelmn commented May 26, 2016

Uh oh!

mike-dubman commented May 26, 2016

Uh oh!

jsquyres commented May 26, 2016

Uh oh!

jsquyres commented May 26, 2016

Uh oh!

mike-dubman commented May 26, 2016

Uh oh!

jladd-mlnx commented May 26, 2016

Uh oh!

jladd-mlnx commented May 26, 2016

Uh oh!

rhc54 commented May 26, 2016

Uh oh!

jladd-mlnx commented May 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented May 26, 2016

Uh oh!

jladd-mlnx commented May 26, 2016

Uh oh!

nysal commented May 27, 2016

Uh oh!

mike-dubman commented May 27, 2016

Uh oh!

rhc54 commented May 27, 2016

Uh oh!

mike-dubman commented May 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

jladd-mlnx commented May 26, 2016 •

edited

Loading

mike-dubman commented May 27, 2016 •

edited

Loading