You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeing a segfault when doing Device - Device transfers (OSU) between NVidia GPUs using the libfabric mtl and LinkX provider. Here is what I do:
compile newest libfabric master (or 2.0.0 release) with CUDA support, and with
either with CXI provider (for a Slingshot system), or with verbs (for an IB-based system)
As discussed previously in #13048, this PR is needed to make OpenMPI + LinkX work correctly on HIP GPUs (without it OpenMPI fails with Required key not available error). However, on a system with NVidia GPUs + CXI the same test fails with a segfault, as reported to OFI in ofiwg/libfabric#10865.
Now I reproduced the same segfault on an Infiniband-based system, with verbs instead of CXI. To reproduce this I have to extend the patch in #12290 to
Otherwise I get the familiar memory registration error:
--------------------------------------------------------------------------
Open MPI failed to register your buffer.
This error is fatal, your job will abort
Buffer Type: cuda
Buffer Address: 0x148b43600000
Buffer Length: 1
Error: Required key not available (4294967030)
--------------------------------------------------------------------------
When I extend the patch to also set ompi_mtl_ofi.hmem_needs_reg = false for verbs and run OSU benchmark as follows:
export FI_LNX_PROV_LINKS=shm+verbs
mpirun -np 2 -mca pml cm -mca mtl ofi -mca opal_common_ofi_provider_include "shm" -map-by numa ./osu_bibw D D
I understand my change to mtl_ofi_component.c was an experiment, probably wrong, and libfabric is not the main execution path on this system, but the point is that this now results in the same failure on two systems: CUDA+CXI and CUDA+IB, and CUDA+CXI should work. So is it possible that there is something not entirely correct with PR #12290 on NVidia GPUs? Can it be that the buffers should be registered in this case, because the LinkX/CXI providers handle them differently? And if so, should this be fixed in OpenMPI, or libfabric?
The text was updated successfully, but these errors were encountered:
As an update (also given in ofiwg/libfabric#10865). It seems the segfault is caused by a NULL handle passed to libfabric's cuda_gdrcopy_dev_unregister. The benchmark runs through if I modify libfabric code to handle the NULL pointer, as explained in ofiwg/libfabric#10865. I have not figured out if the reason for this lies in libfabric, or some OpenMPI logic, but this seems to be related to the use of gdrcopy.
I also tried to run without gdrcopy, but that results in an OpenMPI error:
FI_HMEM_CUDA_USE_GDRCOPY=0 mpirun -np 2 -mca pml cm -mca mtl ofi -mca opal_common_ofi_provider_include "shm+cxi:lnx" -prtemca ras_base_launch_orted_on_hn 1 -map-by numa ~/gpubind_pmix.sh ./osu_bibw D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
[blancapeak001:00000] *** An error occurred in MPI_Irecv
[blancapeak001:00000] *** reported by process [3024093185,281470681743360]
[blancapeak001:00000] *** on communicator MPI_COMM_WORLD
[blancapeak001:00000] *** MPI_ERR_OTHER: known error not in list
[blancapeak001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[blancapeak001:00000] *** and MPI will try to terminate your MPI job as well)
I'm seeing a segfault when doing Device - Device transfers (OSU) between NVidia GPUs using the libfabric
mtl
and LinkX provider. Here is what I do:compile newest libfabric master (or 2.0.0 release) with CUDA support, and with
either with CXI provider (for a Slingshot system), or with verbs (for an IB-based system)
compile OpenMPI 5.0.7 with PR mtl: ofi change to allow cxi anywhere in provname #12290
As discussed previously in #13048, this PR is needed to make OpenMPI + LinkX work correctly on HIP GPUs (without it OpenMPI fails with
Required key not available
error). However, on a system with NVidia GPUs + CXI the same test fails with a segfault, as reported to OFI in ofiwg/libfabric#10865.Now I reproduced the same segfault on an Infiniband-based system, with verbs instead of CXI. To reproduce this I have to extend the patch in #12290 to
Otherwise I get the familiar memory registration error:
When I extend the patch to also set
ompi_mtl_ofi.hmem_needs_reg = false
forverbs
and run OSU benchmark as follows:I get the exact same segfault as with CUDA+CXI (ofiwg/libfabric#10865)
I understand my change to
mtl_ofi_component.c
was an experiment, probably wrong, and libfabric is not the main execution path on this system, but the point is that this now results in the same failure on two systems: CUDA+CXI and CUDA+IB, and CUDA+CXI should work. So is it possible that there is something not entirely correct with PR #12290 on NVidia GPUs? Can it be that the buffers should be registered in this case, because the LinkX/CXI providers handle them differently? And if so, should this be fixed in OpenMPI, or libfabric?The text was updated successfully, but these errors were encountered: