Skip to content

Use AS_COMPACT collocation for gcp placement groups #2587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 30, 2025

Conversation

r4victor
Copy link
Collaborator

Closes #2586

Tested with n1-highmem-2, a3-highgpu-8g, a3-megagpu-8g. Better results for a3-highgpu-8g and a3-megagpu-8g. n1-highmem-2 works and is as slow as before, so no noticeable downgrade.

NCCL tests on a3-highgpu-8g:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576         16384     float    none      -1    627.4    1.67    1.57      0    614.6    1.71    1.60      0
     2097152         32768     float    none      -1    657.1    3.19    2.99      0    641.2    3.27    3.07      0
     4194304         65536     float    none      -1    669.6    6.26    5.87      0    688.5    6.09    5.71      0
     8388608        131072     float    none      -1    811.5   10.34    9.69      0    729.4   11.50   10.78      0
    16777216        262144     float    none      -1    964.5   17.39   16.31      0    945.5   17.74   16.64      0
    33554432        524288     float    none      -1   1100.9   30.48   28.57      0   1080.8   31.05   29.11      0
    67108864       1048576     float    none      -1   1375.0   48.81   45.76      0   1345.9   49.86   46.75      0
   134217728       2097152     float    none      -1   2418.1   55.51   52.04      0   2434.0   55.14   51.70      0
   268435456       4194304     float    none      -1   4871.5   55.10   51.66      0   4857.8   55.26   51.81      0
   536870912       8388608     float    none      -1   9944.5   53.99   50.61      0   9952.2   53.95   50.57      0
  1073741824      16777216     float    none      -1    19626   54.71   51.29      0    19534   54.97   51.53      0
  2147483648      33554432     float    none      -1    37732   56.91   53.36      0    37323   57.54   53.94      0
  4294967296      67108864     float    none      -1    73500   58.43   54.78      0    73449   58.48   54.82      0
  8589934592     134217728     float    none      -1   147172   58.37   54.72      0   146112   58.79   55.12      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 34.3695 
#

NCCL tests on a3-megagpu-8g:

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        131072     float    none      -1    147.4   56.91   53.35    N/A    144.8   57.94   54.32    N/A
    16777216        262144     float    none      -1    177.6   94.45   88.55    N/A    174.4   96.23   90.21    N/A
    33554432        524288     float    none      -1    258.8  129.68  121.57    N/A    258.0  130.04  121.91    N/A
    67108864       1048576     float    none      -1    433.1  154.96  145.27    N/A    429.1  156.40  146.63    N/A
   134217728       2097152     float    none      -1    794.8  168.87  158.32    N/A    787.5  170.43  159.78    N/A
   268435456       4194304     float    none      -1   1508.3  177.97  166.85    N/A   1497.0  179.31  168.10    N/A
   536870912       8388608     float    none      -1   2852.5  188.21  176.45    N/A   2837.8  189.19  177.36    N/A
  1073741824      16777216     float    none      -1   5502.5  195.14  182.94    N/A   5496.2  195.36  183.15    N/A
  2147483648      33554432     float    none      -1    10814  198.59  186.18    N/A    10800  198.85  186.42    N/A
  4294967296      67108864     float    none      -1    21433  200.39  187.86    N/A    21410  200.61  188.07    N/A
  8589934592     134217728     float    none      -1    42639  201.46  188.87    N/A    42622  201.54  188.94    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 150.96 
#

@r4victor r4victor merged commit fd0c144 into master Apr 30, 2025
25 checks passed
@r4victor r4victor deleted the pr_gcp_placement_group_distance branch April 30, 2025 11:32
un-def added a commit that referenced this pull request May 1, 2025
un-def added a commit that referenced this pull request May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify --max-distance for GCP placement policies
1 participant