-
Notifications
You must be signed in to change notification settings - Fork 903
Topic/ompi-osc-ucx: Add ucx implementation for ompi osc #3690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5ed18ba
to
a6199b1
Compare
Adds a UCX OSC component. Preliminary data collected with ANL's MPI RMA message rate benchmark. Measured on dual socket Sandybridge hosts, Connect-IB FDR HCA, no switch, UCX OSC using RC transport: ~12.5 million messages per second.
OpenIB BTL and RDMA OSC: ~8.8 million messages per second
|
bot:mellanox:retest |
|
||
typedef struct ompi_osc_ucx_component { | ||
ompi_osc_base_component_t super; | ||
ucp_context_h ucp_context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucp_worker contains a pointer to ucp_context. Is there a way to avoid keeping both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no UCX API to get worker's context, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right - we do not expose that
} | ||
|
||
if (assert & MPI_MODE_NOPRECEDE) { | ||
ret = module->comm->c_coll->coll_barrier(module->comm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to make a flush if !(assert & MPI_MODE_NOPRECEDE) to avoid barrier call duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L35)
I do not understand here. This branch returns immediately, and there is a flush after this branch, which should be the !(assert &NOPRECEDE) case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean:
if (!(assert & MPI_MODE_NOPRECEDE)) {
flush is here;
}
return barrier(..);
would be less code than:
if (assert & MPI_MODE_NOPRECEDE) {
return barrier()
}
flush is here;
return barrier(..);
|
||
if (module->lock_count == 0) { | ||
if (module->epoch_type.access != NONE_EPOCH && | ||
module->epoch_type.access != FENCE_EPOCH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock is not supposed to be used in FENCE active epoch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L131)
I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not. If we have a consensus, I know how to do it here.
} | ||
assert(opal_list_is_empty(&module->outstanding_locks) == true); | ||
} else { | ||
assert(module->epoch_type.access != PASSIVE_EPOCH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should not access epoch be passive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L137)
fixed. It should be passive epoch.
int ret = OMPI_SUCCESS; | ||
|
||
if (module->epoch_type.access != NONE_EPOCH && | ||
module->epoch_type.access != FENCE_EPOCH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lock_all should not be used with FENCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L236)
I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not. If we have a consensus, I know how to do it here.
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
} | ||
|
||
state_base = (void *)&(module->state); | ||
ret = mem_map_and_exchange_win_info(&state_base, sizeof(ompi_osc_ucx_state_t), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to collect state_info and win_info in one collective call somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L483-L579)
fixed
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
int i, ret = OMPI_SUCCESS; | ||
|
||
if ((module->epoch_type.access != NONE_EPOCH && | ||
module->epoch_type.access != FENCE_EPOCH) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it prohibited to free the window during active access epoch?
Also indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L655)
I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not, and the state should always be FENCE_EPOCH.
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
assert(opal_list_is_empty(&module->outstanding_locks) == true); | ||
OBJ_DESTRUCT(&module->outstanding_locks); | ||
|
||
while (module->state.lock != TARGET_LOCK_UNLOCKED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not this be treated as an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L669)
I think it is not an error. For example:
rank0-----------rank1
win_lock
put/get
win_unlock
win_free-------win_free
"module->state.lock" is the lock state on the target side (rank1). Rank1 should wait until this state changed to be unlocked, which means rank0 finished all osc, then free the window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see
|
||
error: | ||
if (my_addr) ucp_worker_release_address(mca_osc_ucx_component.ucp_worker, my_addr); | ||
if (recv_buf) free(recv_buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not ucp_worker be released in case of error and single window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L638)
fixed
ompi/mca/osc/ucx/osc_ucx_component.c
Outdated
|
||
static int component_query(struct ompi_win_t *win, void **base, size_t size, int disp_unit, | ||
struct ompi_communicator_t *comm, struct ompi_info_t *info, int flavor) { | ||
return OMPI_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not we return some priority here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(now L203)
I temporarily set priority as 100 (see rdma's prioi is 90). Is that OK or we should have other priority number?
@xinzhao3 `[root@foam1 one-sided]# mpirun -np 2 -mca osc ucx -mca pml ucx osu_acc_latency OSU MPI_Accumulate latency Test v5.3.2 Measured on HUAWEI 2288 Server,CX455A EDR HCA,FDR Switch,Redhat7.2, MLNX_OFED_LINUX-4.0-2.0.0.1,UCX-Master(4bdf8e8),Ompi(https://github.com/xinzhao3/ompi/tree/topic/ompi-osc-ucx) |
@xinzhao3 please update the thread once you finish debug. |
a8a882b
to
2dafb59
Compare
@rishards could you try to test again? Sorry that the accumulate code was not updated. |
}; | ||
|
||
ompi_osc_ucx_module_t ompi_osc_ucx_module_template = { | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use C99 designated initializers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed at L50
if (OMPI_SUCCESS != ret) { | ||
goto error; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce the bcast/allreduce into a single allreduce. See osc/rdma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed at L385
You should squash all of these commits. |
(opal_free_list_item_t*) req); \ | ||
} while (0) | ||
|
||
#endif /* OMPI_OSC_UCX_REQUEST_H */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering all of this code was first copied from another component you MUST maintain the copyrights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed at L3
Lets try to get through this before the v4.0.0 branch. |
2dafb59
to
2900ad7
Compare
@hjelmn @jladd-mlnx I addressed all comments and tested with OSU. Please review again. |
} | ||
|
||
module->state.post_count = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too simplistic. It is possible to get a post message from a different PSCW epoch. Say for example process A will open an access epoch with two difference groups, one right after the other (start (group1), complete (group1), start (group2), complete (group2). In the first epoch it communicates with processes B and C and in the second it communicates with D and E. If you simply use a counter then a post from D or E can erroneously affect the epoch with B and C. Look at both osc/rdma and osc/pt2pt to see the mess we have to deal with. A counter is fine for complete. I think it suffices to say active target is a PIA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinzhao3 If you don't hate active target when this is done I will be very surprised :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn I see what you mean. You are right, simple post count is not correct. I am looking at the osc/rdma implementation now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn I modified win_post and win_start using the algorithm in osc/rdma, could you review again? (ucx_post is at L240, ucx_start is at L93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn is there any other issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn @jladd-mlnx I tested with ompi-tests/onesided and ompi-tests/mpi_test_suite/one-sided. I am trying to trigger MTT to test my branch, so that we can run entire ompi-tests.
2900ad7
to
0bbe0cc
Compare
bot:mellanox:retest |
Before I continue. Did you run this with ompi-tests? |
@hjelmn No, I just run OSU tests and some of my own tests. How to run ompi-tests? Is there a way to let jenkins run it? |
0bbe0cc
to
f0871e5
Compare
f0871e5
to
3e93679
Compare
@jladd-mlnx I don't know if @xinzhao3 has access to the ompi-tests repository. If not can you get her a copy. @xinzhao3 Out of the ompi-tests repo run all the tests in the one-sided and ibm sub-directories. The one-sided tests are a simple |
Signed-off-by: Xin Zhao <[email protected]>
3e93679
to
2aa5292
Compare
@hjelmn I have tested the code with ompi-tests. Could you review the code again? |
Merging this now to get under our MTT. |
No description provided.