Open
Description
mpi-operator MPIJob
got IntelMPI support already in Summer 2021. Although traning-operator added MPIJob
(shortly) after that, it's still missing IntelMPI support.
Related mpi-operator PRs can be seen from this list: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed
PR #1804 adds IntelMPI env var support, but there are also other things that are needed.
IMHO most important ones from the mpi-operator are:
- IntelMPI vs. OpenMPI worker slots format support: Add slots to hostfile mpi-operator#523
- Option for which MPI implementation is in question: add support for using Intel MPI(2019.7) and MVAPICH2 mpi-operator#283
And these few other PRs could also be relevant:
- Connection repeat for robustness: Add support for Intel MPI mpi-operator#389
- Readiness probe when SSH is used: Add readiness probe to Intel MPI jobs mpi-operator#425
- E2E tests robustness: Fix intel MPI E2E test image mpi-operator#417
- Examples: Add base images and make PI samples inherit from it mpi-operator#419