Skip to content

v3.0x: Hostfile behavior under RM allocation #3984

Closed
@artpol84

Description

@artpol84

Open MPI version


Details of the problem

When running application under SLURM allocation but using hostfile ras/slurm breaks the launch:

$ mpirun --bind-to core --map-by node -hostfile ./hfile -np 24 -mca pml ob1 -mca btl tcp,self --mca ras_base_verbose 100 <app>
[headnode:04487] mca: base: components_register: registering framework ras components
[headnode:04487] mca: base: components_register: found loaded component slurm
[headnode:04487] mca: base: components_register: component slurm register function successful
[headnode:04487] mca: base: components_open: opening ras components
[headnode:04487] mca: base: components_open: found loaded component slurm
[headnode:04487] mca: base: components_open: component slurm open function successful
[headnode:04487] mca:base:select: Auto-selecting ras components
[headnode:04487] mca:base:select:(  ras) Querying component [slurm]
[headnode:04487] mca:base:select:(  ras) Query of component [slurm] set priority to 50
[headnode:04487] mca:base:select:(  ras) Selected component [slurm]

======================   ALLOCATED NODES   ======================
        cn01: flags=0x10 slots=12 max_slots=0 slots_inuse=0 state=UP
        cn02: flags=0x10 slots=12 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 24 slots
that were requested by the application:
  /hpc/local/benchmarks/hpcx_install_Sunday/hpcx-gcc-redhat7.2/ompi-v3.0.x/tests/osu-micro-benchmarks-5.3.2/osu_barrier

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[headnode:04487] mca: base: close: component slurm closed
[headnode:04487] mca: base: close: unloading component slurm

if I explicitly disable slurm ras all works fine:

$ mpirun --bind-to core --map-by node -hostfile ./hfile -np 24 -mca pml ob1 -mca btl tcp,self --mca ras_base_verbose 100 --mca ras '^slurm' <app>
[headnode:06721] mca: base: components_register: registering framework ras components
[headnode:06721] mca: base: components_register: found loaded component simulator
[headnode:06721] mca: base: components_register: component simulator register function successful
[headnode:06721] mca: base: components_open: opening ras components
[headnode:06721] mca: base: components_open: found loaded component simulator
[headnode:06721] mca:base:select: Auto-selecting ras components
[headnode:06721] mca:base:select:(  ras) Querying component [simulator]
[headnode:06721] mca:base:select:(  ras) No component selected!

======================   ALLOCATED NODES   ======================
        cn01: flags=0x00 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
        cn02: flags=0x00 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================

... <app output> ...

@rhc54 is this an expected behavior?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions