Skip to content

Topic/ompi-osc-ucx: Add ucx implementation for ompi osc #3690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 25, 2017

Conversation

xinzhao3
Copy link
Contributor

@xinzhao3 xinzhao3 commented Jun 9, 2017

No description provided.

@jladd-mlnx jladd-mlnx requested a review from yosefe June 9, 2017 19:54
@jladd-mlnx jladd-mlnx added this to the v3.1.0 milestone Jun 9, 2017
@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch 2 times, most recently from 5ed18ba to a6199b1 Compare June 9, 2017 20:31
@jladd-mlnx
Copy link
Member

Adds a UCX OSC component.

Preliminary data collected with ANL's MPI RMA message rate benchmark.

Measured on dual socket Sandybridge hosts, Connect-IB FDR HCA, no switch, MLNX_OFED_LINUX-3.4-2.1.9.0, ucx-master 1.3.3036 (a5e6e99), Redhat 7.2 :

UCX OSC using RC transport: ~12.5 million messages per second.

tomislavj@vegas10 $mpirun --display-map --map-by ppr:1:node -n 2 -mca pml yalla -mca osc ucx -x MXM_RDMA_PORTS=mlx5_0:1 -x MXM_TLS=rc -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_mlx5,cm ./put.out

TimeTotal,IssueRate
0.000020657666,12392493.89
0.000020527281,12471208.52
0.000020442065,12523196.74
0.000020409934,12542911.56
0.000020424835,12533760.75
0.000020457897,12513504.97
0.000020429026,12531189.48
0.000020452775,12516638.90
0.000020465814,12508664.71
0.000020442065,12523196.74

OpenIB BTL and RDMA OSC: ~8.8 million messages per second

tomislavj@vegas10 $mpirun -display-map --map-by ppr:1:node -n 2 -mca pml ob1 -mca btl self,openib -mca btl_openib_if_include mlx5_0:1 -mca osc rdma ./put.out

TimeTotal,IssueRate
0.000031112024,8228330.02
0.000029401999,8706890.92
0.000028848008,8874096.33
0.000028915005,8853534.76
0.000028624985,8943236.09
0.000028779992,8895068.45
0.000028768991,8898469.92
0.000028587005,8955118.01
0.000039062987,6553518.36
0.000028893992,8859973.43

@jladd-mlnx
Copy link
Member

bot:mellanox:retest


typedef struct ompi_osc_ucx_component {
ompi_osc_base_component_t super;
ucp_context_h ucp_context;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_worker contains a pointer to ucp_context. Is there a way to avoid keeping both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no UCX API to get worker's context, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right - we do not expose that

}

if (assert & MPI_MODE_NOPRECEDE) {
ret = module->comm->c_coll->coll_barrier(module->comm,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to make a flush if !(assert & MPI_MODE_NOPRECEDE) to avoid barrier call duplication.

Copy link
Contributor Author

@xinzhao3 xinzhao3 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L35)
I do not understand here. This branch returns immediately, and there is a flush after this branch, which should be the !(assert &NOPRECEDE) case, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean:

if (!(assert & MPI_MODE_NOPRECEDE)) {
    flush is here;
}
return barrier(..);

would be less code than:

if (assert & MPI_MODE_NOPRECEDE) {
    return barrier()
}
flush is here;
return barrier(..);


if (module->lock_count == 0) {
if (module->epoch_type.access != NONE_EPOCH &&
module->epoch_type.access != FENCE_EPOCH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock is not supposed to be used in FENCE active epoch

Copy link
Contributor Author

@xinzhao3 xinzhao3 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L131)
I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not. If we have a consensus, I know how to do it here.

}
assert(opal_list_is_empty(&module->outstanding_locks) == true);
} else {
assert(module->epoch_type.access != PASSIVE_EPOCH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should not access epoch be passive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L137)
fixed. It should be passive epoch.

int ret = OMPI_SUCCESS;

if (module->epoch_type.access != NONE_EPOCH &&
module->epoch_type.access != FENCE_EPOCH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock_all should not be used with FENCE

Copy link
Contributor Author

@xinzhao3 xinzhao3 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L236)
I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not. If we have a consensus, I know how to do it here.

}

state_base = (void *)&(module->state);
ret = mem_map_and_exchange_win_info(&state_base, sizeof(ompi_osc_ucx_state_t),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to collect state_info and win_info in one collective call somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L483-L579)

fixed

int i, ret = OMPI_SUCCESS;

if ((module->epoch_type.access != NONE_EPOCH &&
module->epoch_type.access != FENCE_EPOCH) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it prohibited to free the window during active access epoch?
Also indentation.

Copy link
Contributor Author

@xinzhao3 xinzhao3 Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L655)

I think this is the same problem with the first long discussion. When the user call Fence with no assert, MPI library cannot decide if epoch is closed or not, and the state should always be FENCE_EPOCH.

assert(opal_list_is_empty(&module->outstanding_locks) == true);
OBJ_DESTRUCT(&module->outstanding_locks);

while (module->state.lock != TARGET_LOCK_UNLOCKED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not this be treated as an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L669)
I think it is not an error. For example:
rank0-----------rank1

win_lock
put/get
win_unlock
win_free-------win_free

"module->state.lock" is the lock state on the target side (rank1). Rank1 should wait until this state changed to be unlocked, which means rank0 finished all osc, then free the window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see


error:
if (my_addr) ucp_worker_release_address(mca_osc_ucx_component.ucp_worker, my_addr);
if (recv_buf) free(recv_buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not ucp_worker be released in case of error and single window?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L638)

fixed


static int component_query(struct ompi_win_t *win, void **base, size_t size, int disp_unit,
struct ompi_communicator_t *comm, struct ompi_info_t *info, int flavor) {
return OMPI_SUCCESS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not we return some priority here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(now L203)
I temporarily set priority as 100 (see rdma's prioi is 90). Is that OK or we should have other priority number?

@rishards
Copy link

rishards commented Jul 8, 2017

@xinzhao3
Have you tested with OSU Micro-Benchmarks? We met some Fault when tested with One-sided MPI Benchmarks.

`[root@foam1 one-sided]# mpirun -np 2 -mca osc ucx -mca pml ucx osu_acc_latency

OSU MPI_Accumulate latency Test v5.3.2
Window creation: MPI_Win_allocate
Synchronization: MPI_Win_flush
Size Latency (us)
[foam1:29141] *** Process received signal ***
[foam1:29141] Signal: Aborted (6)
[foam1:29141] Signal code: (-6)
[foam1:29142] *** Process received signal ***
[foam1:29142] Signal: Aborted (6)
[foam1:29142] Signal code: (-6)`

Measured on HUAWEI 2288 Server,CX455A EDR HCA,FDR Switch,Redhat7.2, MLNX_OFED_LINUX-4.0-2.0.0.1,UCX-Master(4bdf8e8),Ompi(https://github.com/xinzhao3/ompi/tree/topic/ompi-osc-ucx)
I wonder whether there is something wrong with my configuration.
For comparison,Do you know where I could get ANL’s MPI RMA message rate benchmark?

@jladd-mlnx
Copy link
Member

@xinzhao3 please update the thread once you finish debug.

@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch 4 times, most recently from a8a882b to 2dafb59 Compare July 11, 2017 23:31
@xinzhao3
Copy link
Contributor Author

@rishards could you try to test again? Sorry that the accumulate code was not updated.

};

ompi_osc_ucx_module_t ompi_osc_ucx_module_template = {
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use C99 designated initializers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at L50

if (OMPI_SUCCESS != ret) {
goto error;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce the bcast/allreduce into a single allreduce. See osc/rdma.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at L385

@hjelmn
Copy link
Member

hjelmn commented Jul 12, 2017

You should squash all of these commits.

(opal_free_list_item_t*) req); \
} while (0)

#endif /* OMPI_OSC_UCX_REQUEST_H */
Copy link
Member

@hjelmn hjelmn Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering all of this code was first copied from another component you MUST maintain the copyrights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at L3

@hjelmn
Copy link
Member

hjelmn commented Jul 12, 2017

Lets try to get through this before the v4.0.0 branch.

@jladd-mlnx
Copy link
Member

@hjelmn Thanks for your review. @xinzhao3 Please address Nathan's comments.

@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch from 2dafb59 to 2900ad7 Compare July 13, 2017 16:25
@xinzhao3
Copy link
Contributor Author

@hjelmn @jladd-mlnx I addressed all comments and tested with OSU. Please review again.

}

module->state.post_count = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too simplistic. It is possible to get a post message from a different PSCW epoch. Say for example process A will open an access epoch with two difference groups, one right after the other (start (group1), complete (group1), start (group2), complete (group2). In the first epoch it communicates with processes B and C and in the second it communicates with D and E. If you simply use a counter then a post from D or E can erroneously affect the epoch with B and C. Look at both osc/rdma and osc/pt2pt to see the mess we have to deal with. A counter is fine for complete. I think it suffices to say active target is a PIA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinzhao3 If you don't hate active target when this is done I will be very surprised :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn I see what you mean. You are right, simple post count is not correct. I am looking at the osc/rdma implementation now.

Copy link
Contributor Author

@xinzhao3 xinzhao3 Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn I modified win_post and win_start using the algorithm in osc/rdma, could you review again? (ucx_post is at L240, ucx_start is at L93)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn is there any other issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn @jladd-mlnx I tested with ompi-tests/onesided and ompi-tests/mpi_test_suite/one-sided. I am trying to trigger MTT to test my branch, so that we can run entire ompi-tests.

@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch from 2900ad7 to 0bbe0cc Compare July 13, 2017 21:06
@xinzhao3
Copy link
Contributor Author

bot:mellanox:retest

@hjelmn
Copy link
Member

hjelmn commented Jul 13, 2017

Before I continue. Did you run this with ompi-tests?

@xinzhao3
Copy link
Contributor Author

@hjelmn No, I just run OSU tests and some of my own tests. How to run ompi-tests? Is there a way to let jenkins run it?

@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch from 0bbe0cc to f0871e5 Compare July 14, 2017 01:31
@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch from f0871e5 to 3e93679 Compare July 14, 2017 23:42
@hjelmn
Copy link
Member

hjelmn commented Jul 17, 2017

@jladd-mlnx I don't know if @xinzhao3 has access to the ompi-tests repository. If not can you get her a copy.

@xinzhao3 Out of the ompi-tests repo run all the tests in the one-sided and ibm sub-directories. The one-sided tests are a simple make; make check and the ibm ones are ./autogen.pl ; ./configure ; make ; cd one-sided ; make check. You will need to make sure all tests are run across multiple nodes and that the ucx component is being used.

@xinzhao3 xinzhao3 force-pushed the topic/ompi-osc-ucx branch from 3e93679 to 2aa5292 Compare July 19, 2017 16:46
@xinzhao3
Copy link
Contributor Author

@hjelmn I have tested the code with ompi-tests. Could you review the code again?

@jladd-mlnx
Copy link
Member

@hjelmn sorry, @xinzhao3 has access and has been debugging for the past couple of days. She's knocked all open issues out and is ready for another round of review when you have some time.

@jladd-mlnx
Copy link
Member

Merging this now to get under our MTT.

@jladd-mlnx jladd-mlnx merged commit 8f5cb4c into open-mpi:master Jul 25, 2017
@xinzhao3 xinzhao3 deleted the topic/ompi-osc-ucx branch January 22, 2018 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants