Skip to content

Commit baf2a3c

Browse files
Merge pull request #1320 from MicrosoftDocs/main
Merged by Learn.Build PR Management system
2 parents 61a6ece + 81e700d commit baf2a3c

File tree

1 file changed

+19
-18
lines changed

1 file changed

+19
-18
lines changed

articles/aks/gpu-cluster.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.author: schaffererin
1313

1414
# Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)
1515

16-
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
16+
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
1717

1818
This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.
1919

@@ -92,7 +92,7 @@ To use the default OS SKU, you create the node pool without specifying an OS SKU
9292
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
9393
9494
> [!NOTE]
95-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
95+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
9696
9797
##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool)
9898
@@ -127,13 +127,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
127127
128128
---
129129
130-
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
130+
1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
131131
132132
```bash
133133
kubectl create namespace gpu-resources
134134
```
135135
136-
3. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
136+
1. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
137137
138138
```yaml
139139
apiVersion: apps/v1
@@ -181,13 +181,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
181181
path: /var/lib/kubelet/device-plugins
182182
```
183183
184-
4. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
184+
1. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
185185
186186
```bash
187187
kubectl apply -f nvidia-device-plugin-ds.yaml
188188
```
189189
190-
5. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
190+
1. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
191191
192192
193193
### Skip GPU driver installation
@@ -211,7 +211,9 @@ If you want to control the installation of the NVIDIA drivers or use the [NVIDIA
211211
212212
Setting the `--gpu-driver` API field to `none` during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
213213
214-
3. You can optionally install the NVIDIA GPU Operator following [these steps][nvidia-gpu-operator].
214+
If you get the error `unrecognized arguments: --gpu-driver none` then [update the Azure CLI version](/cli/azure/update-azure-cli). For more information, see [Before you begin](#before-you-begin).
215+
216+
1. You can optionally install the NVIDIA GPU Operator following [these steps][nvidia-gpu-operator].
215217
216218
## Confirm that GPUs are schedulable
217219
@@ -230,7 +232,7 @@ After creating your cluster, confirm that GPUs are schedulable in Kubernetes.
230232
aks-gpunp-28993262-0 Ready agent 13m v1.20.7
231233
```
232234
233-
2. Confirm the GPUs are schedulable using the [`kubectl describe node`][kubectl-describe] command.
235+
1. Confirm the GPUs are schedulable using the [`kubectl describe node`][kubectl-describe] command.
234236
235237
```console
236238
kubectl describe node aks-gpunp-28993262-0
@@ -289,7 +291,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
289291
effect: "NoSchedule"
290292
```
291293
292-
2. Run the job using the [`kubectl apply`][kubectl-apply] command, which parses the manifest file and creates the defined Kubernetes objects.
294+
1. Run the job using the [`kubectl apply`][kubectl-apply] command, which parses the manifest file and creates the defined Kubernetes objects.
293295
294296
```console
295297
kubectl apply -f samples-tf-mnist-demo.yaml
@@ -312,15 +314,15 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
312314
samples-tf-mnist-demo 1/1 3m10s 3m36s
313315
```
314316
315-
2. Exit the `kubectl --watch` process with *Ctrl-C*.
317+
1. Exit the `kubectl --watch` process with *Ctrl-C*.
316318
317-
3. Get the name of the pod using the [`kubectl get pods`][kubectl-get] command.
319+
1. Get the name of the pod using the [`kubectl get pods`][kubectl-get] command.
318320
319321
```console
320322
kubectl get pods --selector app=samples-tf-mnist-demo
321323
```
322324
323-
4. View the output of the GPU-enabled workload using the [`kubectl logs`][kubectl-logs] command.
325+
1. View the output of the GPU-enabled workload using the [`kubectl logs`][kubectl-logs] command.
324326
325327
```console
326328
kubectl logs samples-tf-mnist-demo-smnr6
@@ -330,7 +332,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
330332
331333
```console
332334
2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
333-
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
335+
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
334336
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
335337
pciBusID: 2fd7:00:00.0
336338
totalMemory: 11.17GiB freeMemory: 11.10GiB
@@ -372,11 +374,11 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
372374
373375
## Clean up resources
374376
375-
* Remove the associated Kubernetes objects you created in this article using the [`kubectl delete job`][kubectl delete] command.
377+
Remove the associated Kubernetes objects you created in this article using the [`kubectl delete job`][kubectl delete] command.
376378
377-
```console
378-
kubectl delete jobs samples-tf-mnist-demo
379-
```
379+
```console
380+
kubectl delete jobs samples-tf-mnist-demo
381+
```
380382

381383
## Next steps
382384

@@ -423,4 +425,3 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
423425
[az-extension-add]: /cli/azure/extension#az-extension-add
424426
[az-extension-update]: /cli/azure/extension#az-extension-update
425427
[NVadsA10]: /azure/virtual-machines/nva10v5-series
426-

0 commit comments

Comments
 (0)