Support Nebius InfiniBand clusters #2604

jvstme · 2025-05-06T12:41:38Z

An InfiniBand cluster is now created automatically when provisioning 8xH100 or 8xH200 instances using a fleet configuration with placement: cluster.

Nebius clusters were supported using dstack
placement groups. Several changes were made to the placement group management logic:

The offer for the master instance of the fleet is passed to Compute.create_placement_group, which allows to set different placement group settings based on the offer. Nebius requires different settings for H100 and H200 clusters.
Compute.is_suitable_placement_group is introduced to allow choosing an appropriate placement group when creating the master instance and filtering offers for non-master instances based on backend-specific placement group properties. Nebius currently only provides homogeneous clusters, so offers need to be filtered based on the placement group.
The placement group object is passed to Compute.create_instance to allow adding the instance to the placement group using its backend-specific properties, such as cluster ID on Nebius.
The placement group name is generated at master instance provisioning time, not at fleet creation time. This allows to have different placement group names for the same fleet and avoid name conflicts, since multiple placement groups can be created while dstack is trying different offers for the master instance.
Placement groups that were created during master instance provisioning but didn't end up being used are now cleaned up. Nebius quotas limit the number of clusters, so unused clusters need to be cleaned up quickly, without waiting for fleet deletion.
If all offers failed for the master instance, dstack will no longer attempt to provision other fleet instances to avoid them being provisioned without a placement group or without connectivity at all.
Placement group creation errors are now handled gracefully, so that dstack can move on to other master instance offers, which may lead to creating different placement groups. For example, if dstack cannot create a cluster in one Nebius region because of a missing quota, it may attempt to create a cluster in another region.

#2590

An InfiniBand cluster is now created automatically when provisioning 8xH100 or 8xH200 instances using a fleet configuration with `placement: cluster`. Nebius clusters were supported using `dstack` placement groups. Several changes were made to the placement group management logic: - The offer for the master instance of the fleet is passed to `Compute.create_placement_group`, which allows to set different placement group settings based on the offer. Nebius requires different settings for H100 and H200 clusters. - `Compute.is_suitable_placement_group` is introduced to allow choosing an appropriate placement group when creating the master instance and filtering offers for non-master instances based on backend-specific placement group properties. Nebius currently only provides homogeneous clusters, so offers need to be filtered based on the placement group. - The placement group object is passed to `Compute.create_instance` to allow adding the instance to the placement group using its backend-specific properties, such as cluster ID on Nebius. - The placement group name is generated at master instance provisioning time, not at fleet creation time. This allows to have different placement group names for the same fleet and avoid name conflicts, since multiple placement groups can be created while `dstack` is trying different offers for the master instance. - Placement groups that were created during master instance provisioning but didn't end up being used are now cleaned up. Nebius quotas limit the number of clusters, so unused clusters need to be cleaned up quickly, without waiting for fleet deletion. - If all offers failed for the master instance, `dstack` will no longer attempt to provision other fleet instances to avoid them being provisioned without a placement group or without connectivity at all. - Placement group creation errors are now handled gracefully, so that `dstack` can move on to other master instance offers, which may lead to creating different placement groups. For example, if `dstack` cannot create a cluster in one Nebius region because of a missing quota, it may attempt to create a cluster in another region.

jvstme · 2025-05-06T12:43:40Z

Will mark as ready for review after more testing and adding unit tests

jvstme · 2025-05-07T08:21:47Z

NCCL tests

# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     92 on computeinstance-e00fgx6ezb5j59wpmn device  0 [0000:8d:00] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid     93 on computeinstance-e00fgx6ezb5j59wpmn device  1 [0000:91:00] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid     94 on computeinstance-e00fgx6ezb5j59wpmn device  2 [0000:95:00] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid     95 on computeinstance-e00fgx6ezb5j59wpmn device  3 [0000:99:00] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid     96 on computeinstance-e00fgx6ezb5j59wpmn device  4 [0000:ab:00] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid     98 on computeinstance-e00fgx6ezb5j59wpmn device  5 [0000:af:00] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid    100 on computeinstance-e00fgx6ezb5j59wpmn device  6 [0000:b3:00] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid    103 on computeinstance-e00fgx6ezb5j59wpmn device  7 [0000:b7:00] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid    105 on computeinstance-e00r463y4p6my1q40j device  0 [0000:8d:00] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid    106 on computeinstance-e00r463y4p6my1q40j device  1 [0000:91:00] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid    107 on computeinstance-e00r463y4p6my1q40j device  2 [0000:95:00] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid    108 on computeinstance-e00r463y4p6my1q40j device  3 [0000:99:00] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid    109 on computeinstance-e00r463y4p6my1q40j device  4 [0000:ab:00] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid    111 on computeinstance-e00r463y4p6my1q40j device  5 [0000:af:00] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid    113 on computeinstance-e00r463y4p6my1q40j device  6 [0000:b3:00] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid    116 on computeinstance-e00r463y4p6my1q40j device  7 [0000:b7:00] NVIDIA H100 80GB HBM3

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    75.33    0.00    0.00      0    34.21    0.00    0.00      0
          16             4     float     sum      -1    29.39    0.00    0.00      0    29.42    0.00    0.00      0
          32             8     float     sum      -1    29.08    0.00    0.00      0    29.73    0.00    0.00      0
          64            16     float     sum      -1    29.22    0.00    0.00      0    29.13    0.00    0.00      0
         128            32     float     sum      -1    30.11    0.00    0.01      0    29.67    0.00    0.01      0
         256            64     float     sum      -1    63.19    0.00    0.01      0    30.50    0.01    0.02      0
         512           128     float     sum      -1    33.33    0.02    0.03      0    31.14    0.02    0.03      0
        1024           256     float     sum      -1    34.77    0.03    0.06      0    31.98    0.03    0.06      0
        2048           512     float     sum      -1    33.63    0.06    0.11      0    33.40    0.06    0.11      0
        4096          1024     float     sum      -1    35.86    0.11    0.21      0    35.14    0.12    0.22      0
        8192          2048     float     sum      -1    36.81    0.22    0.42      0    35.66    0.23    0.43      0
       16384          4096     float     sum      -1    37.63    0.44    0.82      0    36.00    0.46    0.85      0
       32768          8192     float     sum      -1    50.15    0.65    1.23      0    35.86    0.91    1.71      0
       65536         16384     float     sum      -1    38.27    1.71    3.21      0    36.63    1.79    3.35      0
      131072         32768     float     sum      -1    43.76    2.99    5.62      0    41.31    3.17    5.95      0
      262144         65536     float     sum      -1    52.07    5.03    9.44      0    57.45    4.56    8.56      0
      524288        131072     float     sum      -1    124.2    4.22    7.92      0    82.78    6.33   11.88      0
     1048576        262144     float     sum      -1    82.74   12.67   23.76      0    82.82   12.66   23.74      0
     2097152        524288     float     sum      -1    96.63   21.70   40.69      0    98.05   21.39   40.10      0
     4194304       1048576     float     sum      -1    116.3   36.06   67.62      0    117.9   35.59   66.73      0
     8388608       2097152     float     sum      -1    159.5   52.60   98.62      0    174.8   48.00   90.00      0
    16777216       4194304     float     sum      -1    210.5   79.71  149.46      0    205.1   81.79  153.35      0
    33554432       8388608     float     sum      -1    299.8  111.92  209.85      0    281.9  119.01  223.15      0
    67108864      16777216     float     sum      -1    472.0  142.19  266.61      0    475.8  141.04  264.44      0
   134217728      33554432     float     sum      -1    764.5  175.56  329.18      0    749.6  179.06  335.74      0
   268435456      67108864     float     sum      -1   1293.1  207.60  389.24      0   1287.2  208.55  391.03      0
   536870912     134217728     float     sum      -1   2537.9  211.54  396.65      0   2537.6  211.57  396.68      0
  1073741824     268435456     float     sum      -1   4636.2  231.60  434.25      0   4640.9  231.37  433.81      0
  2147483648     536870912     float     sum      -1   9250.7  232.14  435.27      0   9280.0  231.41  433.89      0
  4294967296    1073741824     float     sum      -1    17643  243.44  456.45      0    17728  242.28  454.27      0
  8589934592    2147483648     float     sum      -1    34428  249.51  467.83      0    34446  249.37  467.58      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 122.617

src/dstack/_internal/server/background/tasks/process_instances.py

jvstme added 2 commits May 6, 2025 21:50

Part of the tests and fixes

ce97b81

More fixes

cdf0717

jvstme marked this pull request as ready for review May 7, 2025 07:56

jvstme requested a review from r4victor May 7, 2025 07:57

r4victor reviewed May 7, 2025

View reviewed changes

src/dstack/_internal/server/background/tasks/process_instances.py Show resolved Hide resolved

r4victor approved these changes May 7, 2025

View reviewed changes

jvstme merged commit 004b91e into master May 7, 2025
25 checks passed

jvstme deleted the issue_2590_nebius_clusters branch May 7, 2025 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Nebius InfiniBand clusters #2604

Support Nebius InfiniBand clusters #2604

Uh oh!

jvstme commented May 6, 2025

Uh oh!

jvstme commented May 6, 2025

Uh oh!

jvstme commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support Nebius InfiniBand clusters #2604

Support Nebius InfiniBand clusters #2604

Uh oh!

Conversation

jvstme commented May 6, 2025

Uh oh!

jvstme commented May 6, 2025

Uh oh!

jvstme commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!