Skip to content

Add ARM64 support #2595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/docs/concepts/dev-environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@ name: vscode
ide: vscode

resources:
# 16 or more x86_64 cores
cpu: 16..
# 200GB or more RAM
memory: 200GB..
# 4 GPUs from 40GB to 80GB
Expand All @@ -187,10 +189,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,8 @@ commands:
port: 8000

resources:
# 16 or more x86_64 cores
cpu: 16..
# 2 GPUs of 80GB
gpu: 80GB:2

Expand All @@ -325,10 +327,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@ commands:
- python fine-tuning/qlora/train.py

resources:
# 16 or more x86_64 cores
cpu: 16..
# 200GB or more RAM
memory: 200GB..
# 4 GPUs from 40GB to 80GB
Expand All @@ -204,10 +206,16 @@ resources:

</div>

The `cpu` property also allows you to specify the CPU architecture, `x86` or `arm`. Examples:
`x86:16` (16 x86-64 cores), `arm:8..` (at least 8 ARM64 cores).
If the architecture is not specified, `dstack` tries to infer it from the `gpu` specification
using `x86` as the fallback value.

The `gpu` property allows specifying not only memory size but also GPU vendor, names
and their quantity. Examples: `nvidia` (one NVIDIA GPU), `A100` (one A100), `A10G,A100` (either A10G or A100),
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).
If the vendor is not specified, `dstack` tries to infer it from the GPU name using `nvidia` as the fallback value.

??? info "Google Cloud TPU"
To use TPUs, specify its architecture via the `gpu` property.
Expand Down
11 changes: 11 additions & 0 deletions docs/docs/reference/api/python/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,21 @@ finally:
show_root_toc_entry: false
heading_level: 4
item_id_mapping:
cpu: dstack.api.CPU
gpu: dstack.api.GPU
memory: dstack.api.Memory
Range: dstack.api.Range

### `dstack.api.CPU` { #dstack.api.CPU data-toc-label="CPU" }

#SCHEMA# dstack.api.CPU
overrides:
show_root_heading: false
show_root_toc_entry: false
heading_level: 4
item_id_mapping:
Range: dstack.api.Range

### `dstack.api.GPU` { #dstack.api.GPU data-toc-label="GPU" }

#SCHEMA# dstack.api.GPU
Expand Down
8 changes: 8 additions & 0 deletions docs/docs/reference/dstack.yml/dev-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ The `dev-environment` configuration type allows running [dev environments](../..
required: true
item_id_prefix: resources-

#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/fleet.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,23 @@ The `fleet` configuration type allows creating and updating fleets.
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,15 +129,23 @@ The `service` configuration type allows running [services](../../concepts/servic
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
12 changes: 10 additions & 2 deletions docs/docs/reference/dstack.yml/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,23 @@ The `task` configuration type allows running [tasks](../../concepts/tasks.md).
required: true
item_id_prefix: resources-

#### `resouces.gpu` { #resources-gpu data-toc-label="gpu" }
#### `resources.cpu` { #resources-cpu data-toc-label="cpu" }

#SCHEMA# dstack._internal.core.models.resources.CPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resources.gpu` { #resources-gpu data-toc-label="gpu" }

#SCHEMA# dstack._internal.core.models.resources.GPUSpec
overrides:
show_root_heading: false
type:
required: true

#### `resouces.disk` { #resources-disk data-toc-label="disk" }
#### `resources.disk` { #resources-disk data-toc-label="disk" }

#SCHEMA# dstack._internal.core.models.resources.DiskSpec
overrides:
Expand Down
7 changes: 5 additions & 2 deletions docs/docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,11 @@ For more details on the options below, refer to the [server deployment](../guide
* `DSTACK_SERVER_MAX_OFFERS_TRIED` - Sets how many instance offers to try when starting a job.
Setting a high value can degrade server performance.
* `DSTACK_RUNNER_VERSION` – Sets exact runner version for debug. Defaults to `latest`. Ignored if `DSTACK_RUNNER_DOWNLOAD_URL` is set.
* `DSTACK_RUNNER_DOWNLOAD_URL` – Overrides `dstack-runner` binary download URL.
* `DSTACK_SHIM_DOWNLOAD_URL` – Overrides `dstack-shim` binary download URL.
* `DSTACK_RUNNER_DOWNLOAD_URL` – Overrides `dstack-runner` binary download URL. The URL can contain `{version}` and/or `{arch}` placeholders,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update

  1. runner/README.md
  2. runner/.just (currently it only builds/uploads one arch)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather open another PR for justfile — I have some ideas for improvements.

where `{version}` is `dstack` version in the `X.Y.Z` format or `latest`, and `{arch}` is either `amd64` or `arm64`, for example,
`https://dstack.example.com/{arch}/{version}/dstack-runner`.
* `DSTACK_SHIM_DOWNLOAD_URL` – Overrides `dstack-shim` binary download URL. The URL can contain `{version}` and/or `{arch}` placeholders,
see `DSTACK_RUNNER_DOWNLOAD_URL` for the details.
* `DSTACK_DEFAULT_CREDS_DISABLED` – Disables default credentials detection if set. Defaults to `None`.
* `DSTACK_LOCAL_BACKEND_ENABLED` – Enables local backend for debug if set. Defaults to `None`.

Expand Down
4 changes: 2 additions & 2 deletions src/dstack/_internal/cli/services/args.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ def port_mapping(v: str) -> PortMapping:
return PortMapping.parse(v)


def cpu_spec(v: str) -> resources.Range[int]:
return parse_obj_as(resources.Range[int], v)
def cpu_spec(v: str) -> dict:
return resources.CPUSpec.parse(v)


def memory_spec(v: str) -> resources.Range[resources.Memory]:
Expand Down
40 changes: 38 additions & 2 deletions src/dstack/_internal/cli/services/configurators/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
from typing import Dict, List, Optional, Set, Tuple

import gpuhunt
from pydantic import parse_obj_as

import dstack._internal.core.models.resources as resources
from dstack._internal.cli.services.args import disk_spec, gpu_spec, port_mapping
from dstack._internal.cli.services.args import cpu_spec, disk_spec, gpu_spec, port_mapping
from dstack._internal.cli.services.configurators.base import (
ApplyEnvVarsConfiguratorMixin,
BaseApplyConfigurator,
Expand Down Expand Up @@ -39,6 +40,7 @@
TaskConfiguration,
)
from dstack._internal.core.models.repos.base import Repo
from dstack._internal.core.models.resources import CPUSpec
from dstack._internal.core.models.runs import JobSubmission, JobTerminationReason, RunStatus
from dstack._internal.core.services.configs import ConfigManager
from dstack._internal.core.services.diff import diff_models
Expand Down Expand Up @@ -72,6 +74,7 @@ def apply_configuration(
):
self.apply_args(conf, configurator_args, unknown_args)
self.validate_gpu_vendor_and_image(conf)
self.validate_cpu_arch_and_image(conf)
if repo is None:
repo = self.api.repos.load(Path.cwd())
config_manager = ConfigManager()
Expand Down Expand Up @@ -289,6 +292,14 @@ def register_args(cls, parser: argparse.ArgumentParser, default_max_offers: int
default=default_max_offers,
)
cls.register_env_args(configuration_group)
configuration_group.add_argument(
"--cpu",
type=cpu_spec,
help="Request CPU for the run. "
"The format is [code]ARCH[/]:[code]COUNT[/] (all parts are optional)",
dest="cpu_spec",
metavar="SPEC",
)
configuration_group.add_argument(
"--gpu",
type=gpu_spec,
Expand All @@ -310,6 +321,8 @@ def apply_args(self, conf: BaseRunConfiguration, args: argparse.Namespace, unkno
apply_profile_args(args, conf)
if args.run_name:
conf.name = args.run_name
if args.cpu_spec:
conf.resources.cpu = resources.CPUSpec.parse_obj(args.cpu_spec)
if args.gpu_spec:
conf.resources.gpu = resources.GPUSpec.parse_obj(args.gpu_spec)
if args.disk_spec:
Expand Down Expand Up @@ -342,7 +355,7 @@ def interpolate_env(self, conf: BaseRunConfiguration):

def validate_gpu_vendor_and_image(self, conf: BaseRunConfiguration) -> None:
"""
Infers `resources.gpu.vendor` if not set, requires `image` if the vendor is AMD.
Infers and sets `resources.gpu.vendor` if not set, requires `image` if the vendor is AMD.
"""
gpu_spec = conf.resources.gpu
if gpu_spec is None:
Expand Down Expand Up @@ -400,6 +413,29 @@ def validate_gpu_vendor_and_image(self, conf: BaseRunConfiguration) -> None:
"`image` is required if `resources.gpu.vendor` is `tenstorrent`"
)

def validate_cpu_arch_and_image(self, conf: BaseRunConfiguration) -> None:
"""
Infers `resources.cpu.arch` if not set, requires `image` if the architecture is ARM.
"""
# TODO: Remove in 0.20. Use conf.resources.cpu directly
cpu_spec = parse_obj_as(CPUSpec, conf.resources.cpu)
arch = cpu_spec.arch
if arch is None:
gpu_spec = conf.resources.gpu
if (
gpu_spec is not None
and gpu_spec.vendor in [None, gpuhunt.AcceleratorVendor.NVIDIA]
and gpu_spec.name
and any(map(gpuhunt.is_nvidia_superchip, gpu_spec.name))
):
arch = gpuhunt.CPUArchitecture.ARM
else:
arch = gpuhunt.CPUArchitecture.X86
# NOTE: We don't set the inferred resources.cpu.arch for compatibility with older servers.
# Servers with ARM support set the arch using the same logic.
if arch == gpuhunt.CPUArchitecture.ARM and conf.image is None:
raise ConfigurationError("`image` is required if `resources.cpu.arch` is `arm`")


class RunWithPortsConfigurator(BaseRunConfigurator):
@classmethod
Expand Down
Loading
Loading