Skip to content

Support Nebius InfiniBand clusters #2604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 7, 2025
Merged

Conversation

jvstme
Copy link
Collaborator

@jvstme jvstme commented May 6, 2025

An InfiniBand cluster is now created automatically when provisioning 8xH100 or 8xH200 instances using a fleet configuration with placement: cluster.

Nebius clusters were supported using dstack
placement groups. Several changes were made to the placement group management logic:

  • The offer for the master instance of the fleet is passed to Compute.create_placement_group, which allows to set different placement group settings based on the offer. Nebius requires different settings for H100 and H200 clusters.
  • Compute.is_suitable_placement_group is introduced to allow choosing an appropriate placement group when creating the master instance and filtering offers for non-master instances based on backend-specific placement group properties. Nebius currently only provides homogeneous clusters, so offers need to be filtered based on the placement group.
  • The placement group object is passed to Compute.create_instance to allow adding the instance to the placement group using its backend-specific properties, such as cluster ID on Nebius.
  • The placement group name is generated at master instance provisioning time, not at fleet creation time. This allows to have different placement group names for the same fleet and avoid name conflicts, since multiple placement groups can be created while dstack is trying different offers for the master instance.
  • Placement groups that were created during master instance provisioning but didn't end up being used are now cleaned up. Nebius quotas limit the number of clusters, so unused clusters need to be cleaned up quickly, without waiting for fleet deletion.
  • If all offers failed for the master instance, dstack will no longer attempt to provision other fleet instances to avoid them being provisioned without a placement group or without connectivity at all.
  • Placement group creation errors are now handled gracefully, so that dstack can move on to other master instance offers, which may lead to creating different placement groups. For example, if dstack cannot create a cluster in one Nebius region because of a missing quota, it may attempt to create a cluster in another region.

#2590

An InfiniBand cluster is now created automatically
when provisioning 8xH100 or 8xH200 instances using
a fleet configuration with `placement: cluster`.

Nebius clusters were supported using `dstack`
placement groups. Several changes were made to the
placement group management logic:
- The offer for the master instance of the fleet
  is passed to `Compute.create_placement_group`,
  which allows to set different placement group
  settings based on the offer. Nebius requires
  different settings for H100 and H200 clusters.
- `Compute.is_suitable_placement_group` is
  introduced to allow choosing an appropriate
  placement group when creating the master
  instance and filtering offers for non-master
  instances based on backend-specific placement
  group properties. Nebius currently only provides
  homogeneous clusters, so offers need to be
  filtered based on the placement group.
- The placement group object is passed to
  `Compute.create_instance` to allow adding the
  instance to the placement group using its
  backend-specific properties, such as cluster ID
  on Nebius.
- The placement group name is generated at master
  instance provisioning time, not at fleet
  creation time. This allows to have different
  placement group names for the same fleet and
  avoid name conflicts, since multiple placement
  groups can be created while `dstack` is trying
  different offers for the master instance.
- Placement groups that were created during master
  instance provisioning but didn't end up being
  used are now cleaned up. Nebius quotas limit the
  number of clusters, so unused clusters need to
  be cleaned up quickly, without waiting for fleet
  deletion.
- If all offers failed for the master instance,
  `dstack` will no longer attempt to provision
  other fleet instances to avoid them being
  provisioned without a placement group or without
  connectivity at all.
- Placement group creation errors are now handled
  gracefully, so that `dstack` can move on to
  other master instance offers, which may lead to
  creating different placement groups. For
  example, if `dstack` cannot create a cluster in
  one Nebius region because of a missing quota, it
  may attempt to create a cluster in another
  region.
@jvstme
Copy link
Collaborator Author

jvstme commented May 6, 2025

Will mark as ready for review after more testing and adding unit tests

@jvstme jvstme marked this pull request as ready for review May 7, 2025 07:56
@jvstme jvstme requested a review from r4victor May 7, 2025 07:57
@jvstme
Copy link
Collaborator Author

jvstme commented May 7, 2025

NCCL tests
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     92 on computeinstance-e00fgx6ezb5j59wpmn device  0 [0000:8d:00] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid     93 on computeinstance-e00fgx6ezb5j59wpmn device  1 [0000:91:00] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid     94 on computeinstance-e00fgx6ezb5j59wpmn device  2 [0000:95:00] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid     95 on computeinstance-e00fgx6ezb5j59wpmn device  3 [0000:99:00] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid     96 on computeinstance-e00fgx6ezb5j59wpmn device  4 [0000:ab:00] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid     98 on computeinstance-e00fgx6ezb5j59wpmn device  5 [0000:af:00] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid    100 on computeinstance-e00fgx6ezb5j59wpmn device  6 [0000:b3:00] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid    103 on computeinstance-e00fgx6ezb5j59wpmn device  7 [0000:b7:00] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid    105 on computeinstance-e00r463y4p6my1q40j device  0 [0000:8d:00] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid    106 on computeinstance-e00r463y4p6my1q40j device  1 [0000:91:00] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid    107 on computeinstance-e00r463y4p6my1q40j device  2 [0000:95:00] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid    108 on computeinstance-e00r463y4p6my1q40j device  3 [0000:99:00] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid    109 on computeinstance-e00r463y4p6my1q40j device  4 [0000:ab:00] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid    111 on computeinstance-e00r463y4p6my1q40j device  5 [0000:af:00] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid    113 on computeinstance-e00r463y4p6my1q40j device  6 [0000:b3:00] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid    116 on computeinstance-e00r463y4p6my1q40j device  7 [0000:b7:00] NVIDIA H100 80GB HBM3

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    75.33    0.00    0.00      0    34.21    0.00    0.00      0
          16             4     float     sum      -1    29.39    0.00    0.00      0    29.42    0.00    0.00      0
          32             8     float     sum      -1    29.08    0.00    0.00      0    29.73    0.00    0.00      0
          64            16     float     sum      -1    29.22    0.00    0.00      0    29.13    0.00    0.00      0
         128            32     float     sum      -1    30.11    0.00    0.01      0    29.67    0.00    0.01      0
         256            64     float     sum      -1    63.19    0.00    0.01      0    30.50    0.01    0.02      0
         512           128     float     sum      -1    33.33    0.02    0.03      0    31.14    0.02    0.03      0
        1024           256     float     sum      -1    34.77    0.03    0.06      0    31.98    0.03    0.06      0
        2048           512     float     sum      -1    33.63    0.06    0.11      0    33.40    0.06    0.11      0
        4096          1024     float     sum      -1    35.86    0.11    0.21      0    35.14    0.12    0.22      0
        8192          2048     float     sum      -1    36.81    0.22    0.42      0    35.66    0.23    0.43      0
       16384          4096     float     sum      -1    37.63    0.44    0.82      0    36.00    0.46    0.85      0
       32768          8192     float     sum      -1    50.15    0.65    1.23      0    35.86    0.91    1.71      0
       65536         16384     float     sum      -1    38.27    1.71    3.21      0    36.63    1.79    3.35      0
      131072         32768     float     sum      -1    43.76    2.99    5.62      0    41.31    3.17    5.95      0
      262144         65536     float     sum      -1    52.07    5.03    9.44      0    57.45    4.56    8.56      0
      524288        131072     float     sum      -1    124.2    4.22    7.92      0    82.78    6.33   11.88      0
     1048576        262144     float     sum      -1    82.74   12.67   23.76      0    82.82   12.66   23.74      0
     2097152        524288     float     sum      -1    96.63   21.70   40.69      0    98.05   21.39   40.10      0
     4194304       1048576     float     sum      -1    116.3   36.06   67.62      0    117.9   35.59   66.73      0
     8388608       2097152     float     sum      -1    159.5   52.60   98.62      0    174.8   48.00   90.00      0
    16777216       4194304     float     sum      -1    210.5   79.71  149.46      0    205.1   81.79  153.35      0
    33554432       8388608     float     sum      -1    299.8  111.92  209.85      0    281.9  119.01  223.15      0
    67108864      16777216     float     sum      -1    472.0  142.19  266.61      0    475.8  141.04  264.44      0
   134217728      33554432     float     sum      -1    764.5  175.56  329.18      0    749.6  179.06  335.74      0
   268435456      67108864     float     sum      -1   1293.1  207.60  389.24      0   1287.2  208.55  391.03      0
   536870912     134217728     float     sum      -1   2537.9  211.54  396.65      0   2537.6  211.57  396.68      0
  1073741824     268435456     float     sum      -1   4636.2  231.60  434.25      0   4640.9  231.37  433.81      0
  2147483648     536870912     float     sum      -1   9250.7  232.14  435.27      0   9280.0  231.41  433.89      0
  4294967296    1073741824     float     sum      -1    17643  243.44  456.45      0    17728  242.28  454.27      0
  8589934592    2147483648     float     sum      -1    34428  249.51  467.83      0    34446  249.37  467.58      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 122.617 

@jvstme jvstme merged commit 004b91e into master May 7, 2025
25 checks passed
@jvstme jvstme deleted the issue_2590_nebius_clusters branch May 7, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants