Skip to content

Commit 97f8f5a

Browse files
authored
Merge pull request #132 from tharittk/docs-nccl
Documentation Update for NCCL
2 parents 3585f36 + 0d3680b commit 97f8f5a

File tree

5 files changed

+130
-4
lines changed

5 files changed

+130
-4
lines changed

Makefile

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null)
22
PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null)
33
NUM_PROCESSES = 3
44

5-
.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials
5+
.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials
66

77
pipcheck:
88
ifndef PIP
@@ -24,19 +24,29 @@ dev-install:
2424
make pipcheck
2525
$(PIP) install -r requirements-dev.txt && $(PIP) install -e .
2626

27+
dev-install_nccl:
28+
make pipcheck
29+
$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 $(PIP) install -e .
30+
2731
install_conda:
2832
conda env create -f environment.yml && conda activate pylops_mpi && pip install .
2933

34+
install_conda_nccl:
35+
conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install .
36+
3037
dev-install_conda:
3138
conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e .
3239

40+
dev-install_conda_nccl:
41+
conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e .
42+
3343
lint:
3444
flake8 pylops_mpi/ tests/ examples/ tutorials/
3545

3646
tests:
3747
mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
3848

39-
# assuming NUM_PRCESS <= number of gpus available
49+
# assuming NUM_PROCESSES <= number of gpus available
4050
tests_nccl:
4151
mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi
4252

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ and running the following command:
3434
```
3535
make install_conda
3636
```
37+
Optionally, if you work with multi-GPU environment and want to use Nvidia's collective communication calls (NCCL) enabled, install your environment with
38+
```
39+
make install_conda_nccl
40+
```
3741
3842
## Run Pylops-MPI
3943
Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command.

docs/source/gpu.rst

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
2222
some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
2323
cupy arrays.
2424

25+
Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
26+
collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
27+
proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
28+
29+
.. note::
30+
31+
Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
32+
However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
33+
the :class:`pylops_mpi.DistributedArray`
2534

2635
Example
2736
-------
@@ -79,7 +88,69 @@ your GPU:
7988
The code is almost unchanged apart from the fact that we now use ``cupy`` arrays,
8089
PyLops-mpi will figure this out!
8190

91+
Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray`
92+
as follows:
93+
94+
.. code-block:: python
95+
96+
from pylops_mpi.utils._nccl import initialize_nccl_comm
97+
98+
# Initilize NCCL Communicator
99+
nccl_comm = initialize_nccl_comm()
100+
101+
# Create distributed data (broadcast)
102+
nxl, nt = 20, 20
103+
dtype = np.float32
104+
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
105+
base_comm_nccl=nccl_comm,
106+
partition=pylops_mpi.Partition.BROADCAST,
107+
engine="cupy", dtype=dtype)
108+
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
109+
110+
# Create and apply VStack operator
111+
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
112+
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
113+
y_dist = HOp @ d_dist
114+
115+
Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to
116+
one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL
117+
82118
.. note::
83119

84-
The CuPy backend is in active development, with many examples not yet in the docs.
85-
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
120+
The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
121+
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
122+
123+
Supports for NCCL Backend
124+
----------------------------
125+
In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:
126+
127+
.. list-table::
128+
:widths: 50 25
129+
:header-rows: 1
130+
131+
* - modules
132+
- NCCL supported
133+
* - :class:`pylops_mpi.DistributedArray`
134+
- /
135+
* - :class:`pylops_mpi.basicoperators.MPIVStack`
136+
- Ongoing
137+
* - :class:`pylops_mpi.basicoperators.MPIHStack`
138+
- Ongoing
139+
* - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
140+
- Ongoing
141+
* - :class:`pylops_mpi.basicoperators.MPIGradient`
142+
- Ongoing
143+
* - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
144+
- Ongoing
145+
* - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
146+
- Ongoing
147+
* - :class:`pylops_mpi.basicoperators.MPILaplacian`
148+
- Ongoing
149+
* - :class:`pylops_mpi.optimization.basic.cg`
150+
- Ongoing
151+
* - :class:`pylops_mpi.optimization.basic.cgls`
152+
- Ongoing
153+
* - ISTA Solver
154+
- Planned
155+
* - Complex Numeric Data Type for NCCL
156+
- Planned

docs/source/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo
1414
computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and
1515
parallelized manner.
1616

17+
PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) <https://developer.nvidia.com/nccl>`_ for high-performance
18+
GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to
19+
highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration.
20+
1721
Get started by :ref:`installing PyLops-MPI <Installation>` and following our quick tour.
1822

1923
Terminology

docs/source/installation.rst

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,14 @@ Fork the `PyLops-MPI repository <https://github.com/PyLops/pylops-mpi>`_ and clo
4545
We recommend installing dependencies into a separate environment.
4646
For that end, we provide a `Makefile` with useful commands for setting up the environment.
4747

48+
Enable Nvidia Collective Communication Library
49+
=======================================================
50+
To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls
51+
`(NCCL) <https://developer.nvidia.com/nccl>`_. Two additional dependencies are required: CuPy and NCCL
52+
53+
* `CuPy with NCCL <https://docs.cupy.dev/en/stable/install.html>`_
54+
55+
4856
Step-by-step installation for users
4957
***********************************
5058

@@ -89,6 +97,12 @@ For a ``conda`` environment, run
8997
9098
This will create and activate an environment called ``pylops_mpi``, with all required and optional dependencies.
9199

100+
If you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ in PyLops-MPI, run this instead
101+
102+
.. code-block:: bash
103+
104+
>> make dev-install_conda_nccl
105+
92106
Pip
93107
---
94108
If you prefer a ``pip`` installation, we provide the following command
@@ -100,6 +114,23 @@ If you prefer a ``pip`` installation, we provide the following command
100114
Note that, differently from the ``conda`` command, the above **will not** create a virtual environment.
101115
Make sure you create and activate your environment previously.
102116

117+
Simlarly, if you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ but prefer using pip,
118+
you must first check the CUDA version of your system:
119+
120+
.. code-block:: bash
121+
122+
>> nvidia-smi
123+
124+
The `Makefile` is pre-configured with CUDA 12.x. If you use this version, run
125+
126+
.. code-block:: bash
127+
128+
>> make dev-install_nccl
129+
130+
Otherwise, you can change the command in `Makefile` to an appropriate CUDA version
131+
i.e., If you use CUDA 11.x, change ``cupy-cuda12x`` and ``nvidia-nccl-cu12`` to ``cupy-cuda11x`` and ``nvidia-nccl-cu11``
132+
and run the command.
133+
103134
Run tests
104135
=========
105136
To ensure that everything has been setup correctly, run tests:
@@ -110,6 +141,12 @@ To ensure that everything has been setup correctly, run tests:
110141
111142
Make sure no tests fail, this guarantees that the installation has been successful.
112143

144+
If PyLops-MPI is installed with NCCL, also run tests:
145+
146+
.. code-block:: bash
147+
148+
>> make tests_nccl
149+
113150
Run examples and tutorials
114151
==========================
115152
Since the sphinx-gallery creates examples/tutorials using only a single process, it is highly recommended to test the

0 commit comments

Comments
 (0)