Skip to content

Refactor the request completion #1422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 24, 2016
Merged

Refactor the request completion #1422

merged 3 commits into from
May 24, 2016

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Mar 2, 2016

The current code to handle request completions (for both single and multi-requests calls) is complex and highly inefficient. It has a big impact on our message rate for single threaded applications, and prevents good performance on multi-threaded cases (global lock for completion, wakes all threads upon each request completions, and many more issues).

This patch fixes all cases. First, it refactors the completion code and the link between the completion and opal_progress. It introduces a new synchronization primitive, that is the completion status of the request (COMPLETE, PENDING or a synchronization primitive). As a result the request status is now handled via atomic operations, at least one (to mark the request as completed) and at most 2 per request (one to set the synchronization primitive and one to mark it as complete). Second, refactor all wait* and test* functions to reduce the number of iterations over the array of requests (if any).

@bosilca bosilca added this to the v2.0.0 milestone Mar 2, 2016
@@ -92,7 +90,6 @@ static int mca_pml_ob1_recv_request_free(struct ompi_request_t** request)
MCA_PML_OB1_RECV_REQUEST_RETURN( recvreq );
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the calls to OPAL_THREAD_LOCK removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no locking anymore. Everything related to the synchronization is now handled down in the sync_wait, via atomic operations.

@hppritcha
Copy link
Member

bot:retest

@hppritcha
Copy link
Member

@jsquyres I think this is too risky for 2.0.0

@hppritcha
Copy link
Member

@bosilca please have your student rebase down to a few commits.

@bosilca
Copy link
Member Author

bosilca commented Mar 3, 2016

Performance tests show almost 400% message rate increase over vader with 4 threads per process. Increasing the number of threads only increases this performance boost.

@jsquyres
Copy link
Member

jsquyres commented Mar 3, 2016

@hppritcha I agree that this is too big for v2.0.0 -- I think v2.0.1 at the earliest.

@bosilca What's the merge commit for in the middle? That does not seem correct at all.

Also, it would be really good to clean up this patch set a bunch -- there's a whole lot of intermediate commits (e.g., "foo" is added in commit A, but then removed in commit B). Can this be squashed to a smaller number of logical commits? (it probably doesn't make sense to squash it down to 1 commit, but something less than 30 would be good...)

@jsquyres jsquyres modified the milestones: v2.1.0, v2.0.0 Mar 8, 2016
@jsquyres
Copy link
Member

jsquyres commented Mar 8, 2016

@bosilca tells me that this PR is not yet ready -- they're still working on the code. For the moment, I'm deferring this to v2.1.0.

@jsquyres
Copy link
Member

@bosilca Does this PR represent all the commits that went into master? The merge commit near the bottom of the list seems weird / wrong...

Thananon Patinyasakdikul and others added 2 commits May 19, 2016 16:41
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signalled).
@jsquyres
Copy link
Member

bot:retest

@jsquyres
Copy link
Member

@Di0gen @jladd-mlnx UCX builds still seem to be broken in the Mellanox Jenkins.

@jsquyres
Copy link
Member

Per discussion on the 2015-05-24 webex, we decided:

  • This is a blocker performance bug. We need it for v2.0.0. We/the OMPI community acknowledge that this will further delay v2.0.0.
  • Let's merge this to master and see how it does.

@jsquyres
Copy link
Member

@bosilca After the webex conversation, however, I see that the Mellanox Jenkins failed an OSHMEM test with this PR. It looks like a legit failure -- can you please investigate?

@bosilca
Copy link
Member Author

bosilca commented May 24, 2016

@jsquyres looking at the jenkins failure indicates the issue as being in the RTE, more precisely in the OOB. Here is the only usable stack I can get from the Mellanox output.

10:09:19 [jenkins01:02266] *** Process received signal ***
10:09:19 [jenkins01:02266] Signal: Segmentation fault (11)
10:09:19 [jenkins01:02266] Signal code:  (0)
10:09:19 [jenkins01:02266] Failing at address: 0xb20000008d9
10:09:19 oshrun: Forwarding signal 18 to job
10:09:19 [jenkins01:02266] [ 0] /lib64/libpthread.so.0[0x3d6980f710]
10:09:19 [jenkins01:02266] [ 1] /lib64/libc.so.6(__select+0x33)[0x3d690e15e3]
10:09:19 [jenkins01:02266] [ 2] /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_oob_tcp.so(+0xc1ca)[0x7ffff55411ca]
10:09:19 [jenkins01:02266] [ 3] /lib64/libpthread.so.0[0x3d698079d1]
10:09:19 [jenkins01:02266] [ 4] /lib64/libc.so.6(clone+0x6d)[0x3d690e8b6d]

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2016

Something is very off here, guys - look at the stack trace. The error is in oshrun, not in an application process. Unless Mellanox has done something very odd, there is no way that a change in the MPI request can cause oshrun to fail.

@hjelmn
Copy link
Member

hjelmn commented May 25, 2016

@rhc54 The question is whether the application processes have started or not. They certainly haven't produced any output.

@hjelmn
Copy link
Member

hjelmn commented May 25, 2016

@rhc54 BTW, oshrun crashed because the mellanox jenkins sends a SEGV on timeout.

@jsquyres
Copy link
Member

Oh, wow -- we've been chasing a SEGV, but the SEGV was coming from Mellanox's scripts itself (note the parameter to the "timeout" command):

02:59:18 + taskset -c 16,17 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 --mca spml ikrit -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem
03:09:19 [jenkins01:02266] *** Process received signal ***
03:09:19 [jenkins01:02266] Signal: Segmentation fault (11)

So it looks like the real issue is a hang.

@jladd-mlnx
Copy link
Member

@jsquyres #1716

@jsquyres
Copy link
Member

@jladd-mlnx Yep -- fingers crossed! Can we change the timeout signal to something other than SEGV? That misled us for a while today; it might be helpful in the future to not have to remember that the SEGV isn't actually a SEGV. Thanks!

@mike-dubman
Copy link
Member

@jsquyres - the reason we send SEGV is to make mpirun dump the stacktrace.
Is there any way to make it happen w/o sending SEGV? maybe some mca param to dump stack after X min?

@hjelmn
Copy link
Member

hjelmn commented May 26, 2016

SIGABRT?

@mike-dubman
Copy link
Member

SIGABRT is not interceptable by mpirun and just will kill it w/o stacktrace.

@jsquyres
Copy link
Member

@jladd-mlnx Would it be better to grab a stack trace with something like pstack? E.g., here's pstack output from the PID of my emacs:

$ pstack 28231
Thread 2 (Thread 0x2aaab13aa700 (LWP 28232)):
#0  0x0000003491e0e75d in read () from /lib64/libpthread.so.0
#1  0x0000003492e4110b in ?? () from /lib64/libglib-2.0.so.0
#2  0x0000003492e691b4 in ?? () from /lib64/libglib-2.0.so.0
#3  0x0000003491e079d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00000034916e8b5d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2aaaaaee2b60 (LWP 28231)):
#0  0x00000034916e17de in pselect () from /lib64/libc.so.6
#1  0x00000000005b2ce2 in xg_select ()
#2  0x0000000000582d33 in wait_reading_process_output ()
#3  0x00000000004dbd7f in read_decoded_event_from_main_queue.isra.39 ()
#4  0x00000000004dff37 in read_char ()
#5  0x00000000004e0d0f in read_key_sequence.constprop.41 ()
#6  0x00000000004e2900 in command_loop_1 ()
#7  0x0000000000543a1e in internal_condition_case ()
#8  0x00000000004d4f6e in command_loop_2 ()
#9  0x000000000054392b in internal_catch ()
#10 0x00000000004d9497 in recursive_edit_1 ()
#11 0x00000000004d97ad in Frecursive_edit ()
#12 0x00000000004d42a5 in main ()

@jsquyres
Copy link
Member

@jladd-mlnx Oops -- I mean to tag the prior comment to both @jladd-mlnx and @miked-mellanox. I'm apparently still not awake yet. Sorry!

@mike-dubman
Copy link
Member

the problem is how to detect hang. mpirun has progress thread which can dump stack after user-provided time limit. With your proposal - one need to develop watchdog client and integrate it - it is very error prune IMO.

@jladd-mlnx
Copy link
Member

We currently have no plans to change the script.

@jladd-mlnx
Copy link
Member

For the public record, there were several inaccurate and misleading comments posted to this ticket. OSHMEM has nothing to do with this failure. A hang in the CM PML was triggered during an OSHMEM hello world test because OSHMEM init uses MPI collectives and, hence, the PML to bootstrap itself once MPI is init-ed. The root cause of the hang was because of a lack of support in the CM PML for the new request handling. The combination of OSHMEM with CM PML MXM MTL and IKRIT SML just happened to trigger the hang. A fix for this issue was made in #1716, however, subsequent testing has exposed another bug in the OpenIB BTL that needs to be fixed as well.

Thanks to Mellanox Jenkins for catching this.

@rhc54
Copy link
Contributor

rhc54 commented May 26, 2016

Just a little further clarification: part of the reason for the confusion was the use of an induced segfault in the Mellanox Jenkins script to initiate a timeout shutdown of oshrun. This provides a false positive in a portion of the code totally unrelated to the submitted PR, leaving the impression that the test, and not the code, was the root cause.

It appears we are unable to reach agreement on how best to avoid this going forward. It therefore remains a reasonable possibility that this problem will repeat in the future, pending agreement on a better solution.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented May 26, 2016

The resolution going forward is the following; any community member is more than welcome to access the Mellanox Jenkins server and is encouraged to do so. All that I need is your public RSA key, and I will be happy to add it to the list of authorized keys. Once added, you will simply need to:

$ssh [email protected]

And then to the Jenkins server:

$ssh jenkins01

This takes Mellanox out of the critical path of other organizations' projects and road maps.

@rhc54
Copy link
Contributor

rhc54 commented May 26, 2016

Good to know, but not likely to be useful, to be honest. Many of us are in the same situation as you and don't have the time to chase apparently unrelated issues, and there are legal concerns for some members regarding accessing Mellanox machines. So I don't think that will provide a practical solution.

Probably best for the community to discuss this offline.

@jladd-mlnx
Copy link
Member

The offer stands. It's the best we can do.

@nysal
Copy link
Member

nysal commented May 27, 2016

@miked-mellanox Probably using a tool like PADB (http://padb.pittman.org.uk/) to collect the stack trace after waiting for a timeout period is one option. I have used the code from https://code.google.com/archive/p/padb/source and that seems to work.

@mike-dubman
Copy link
Member

yep. this tool does not fit jenkins framework.

One need to write complicated logic to detect mpirun hang and fork padb to extract stack and properly communicate status to jenkins.

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2016

Yeah, I'm not sure it's worth all that effort. The problem we really need to solve is to tell us that the test hung - getting a stack trace from mpirun won't tell us much about why as mpirun isn't the source of the problem. I think a simple error statement that the test timed out like we do for MTT would be more useful and create less "false positive" confusion.

@mike-dubman
Copy link
Member

mike-dubman commented May 27, 2016

yep, timeout -s SEGV mpirun ... can be replaced w/ timeout -s SIGKILL mpirun and jenkins will reported (in the console) that test killed. The next logical questions from community will be - where it hangs?
So, having mpirun print stacktrace after some internal timeout (if provided) is probably best automated way to satisfy jenkins report and developers questions.

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2016

Yeah, we should think about this a bit. What we really want to know is that we timed out because the applications never terminated, as opposed to we timed out because of a problem in mpirun itself. The problem with just generating an mpirun stack trace is that it won't tell you anything if the problem is in the application. All it can show is that mpirun is merrily waiting for something to happen.

Perhaps the better solution is to provide some way of asking mpirun to output a state report - i.e., what does it think the current state of the job is? Does it think all procs terminated and it is just hung trying to exit? Or does it think procs are still running?

Another thing we could get is a simple "ps" output. Are there procs still running? Or have all the apps exited? This should be simple to get, and would provide more info than a stack trace.

If we can do some of that, it would be helpful to avoid the confusion.

@mike-dubman
Copy link
Member

yep,something like that would be very helpful to all.

achieving same effect with jenkins and external watchdog processes will make things worse, because it is hard to orchestrate inside jenkins single process.

% mpirun -dump-stack-from-all-ranks-and-exit-with-error-after-time 10m imb.exe

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2016

Might be doable - I can take a crack at ti.

@jsquyres
Copy link
Member

@rhc54 Perhaps this is something you could daisy chain off the $MPIEXEC_TIMEOUT functionality...?

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2016

😆 that is exactly what I am coding right this second

@rhc54
Copy link
Contributor

rhc54 commented May 27, 2016

What do you think of this:

$ mpirun -n 4 --map-by node -timeout 2 --report-state-on-timeout ./mpi_spin
--------------------------------------------------------------------------
The user-provided time limit for job execution has been
reached:

  MPIEXEC_TIMEOUT: NULL seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment or -timeout on the cmd line).
--------------------------------------------------------------------------
DATA FOR JOB: [42410,0]
    Num apps: 1 Num procs: 2    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 2 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 10484  State: RUNNING  ExitCode 0
        Rank: 1 Node: bend003   PID: 10489  State: RUNNING  ExitCode 0

DATA FOR JOB: [42410,1]
    Num apps: 1 Num procs: 4    JobState: SYNC REGISTERED   Abort: False
    Num launched: 4 Num reported: 4 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 10490  State: SYNC REGISTERED  ExitCode 0
        Rank: 1 Node: bend003   PID: 28765  State: SYNC REGISTERED  ExitCode 0
        Rank: 2 Node: bend001   PID: 10491  State: SYNC REGISTERED  ExitCode 0
        Rank: 3 Node: bend003   PID: 28766  State: SYNC REGISTERED  ExitCode 0

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).

* Fix the condition to release the request.
(cherry picked from open-mpi/ompi@b90c838)

Signed-off-by: Nathan Hjelm <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this pull request Oct 3, 2016
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).

* Fix the condition to release the request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants