You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/aks/gpu-cluster.md
+19-18Lines changed: 19 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ ms.author: schaffererin
13
13
14
14
# Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)
15
15
16
-
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
16
+
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
17
17
18
18
This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.
19
19
@@ -92,7 +92,7 @@ To use the default OS SKU, you create the node pool without specifying an OS SKU
92
92
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
93
93
94
94
> [!NOTE]
95
-
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
95
+
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
96
96
97
97
##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool)
98
98
@@ -127,13 +127,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
127
127
128
128
---
129
129
130
-
2. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
130
+
1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
131
131
132
132
```bash
133
133
kubectl create namespace gpu-resources
134
134
```
135
135
136
-
3. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
136
+
1. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
137
137
138
138
```yaml
139
139
apiVersion: apps/v1
@@ -181,13 +181,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
181
181
path: /var/lib/kubelet/device-plugins
182
182
```
183
183
184
-
4. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
184
+
1. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
185
185
186
186
```bash
187
187
kubectl apply -f nvidia-device-plugin-ds.yaml
188
188
```
189
189
190
-
5. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
190
+
1. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
191
191
192
192
193
193
### Skip GPU driver installation
@@ -211,7 +211,9 @@ If you want to control the installation of the NVIDIA drivers or use the [NVIDIA
211
211
212
212
Setting the `--gpu-driver` API field to `none` during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
213
213
214
-
3. You can optionally install the NVIDIA GPU Operator following [these steps][nvidia-gpu-operator].
214
+
If you get the error `unrecognized arguments: --gpu-driver none` then [update the Azure CLI version](/cli/azure/update-azure-cli). For more information, see [Before you begin](#before-you-begin).
215
+
216
+
1. You can optionally install the NVIDIA GPU Operator following [these steps][nvidia-gpu-operator].
215
217
216
218
## Confirm that GPUs are schedulable
217
219
@@ -230,7 +232,7 @@ After creating your cluster, confirm that GPUs are schedulable in Kubernetes.
230
232
aks-gpunp-28993262-0 Ready agent 13m v1.20.7
231
233
```
232
234
233
-
2. Confirm the GPUs are schedulable using the [`kubectl describe node`][kubectl-describe] command.
235
+
1. Confirm the GPUs are schedulable using the [`kubectl describe node`][kubectl-describe] command.
234
236
235
237
```console
236
238
kubectl describe node aks-gpunp-28993262-0
@@ -289,7 +291,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
289
291
effect: "NoSchedule"
290
292
```
291
293
292
-
2. Run the job using the [`kubectl apply`][kubectl-apply] command, which parses the manifest file and creates the defined Kubernetes objects.
294
+
1. Run the job using the [`kubectl apply`][kubectl-apply] command, which parses the manifest file and creates the defined Kubernetes objects.
293
295
294
296
```console
295
297
kubectl apply -f samples-tf-mnist-demo.yaml
@@ -312,15 +314,15 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
312
314
samples-tf-mnist-demo 1/1 3m10s 3m36s
313
315
```
314
316
315
-
2. Exit the `kubectl --watch` process with *Ctrl-C*.
317
+
1. Exit the `kubectl --watch` process with *Ctrl-C*.
316
318
317
-
3. Get the name of the pod using the [`kubectl get pods`][kubectl-get] command.
319
+
1. Get the name of the pod using the [`kubectl get pods`][kubectl-get] command.
318
320
319
321
```console
320
322
kubectl get pods --selector app=samples-tf-mnist-demo
321
323
```
322
324
323
-
4. View the output of the GPU-enabled workload using the [`kubectl logs`][kubectl-logs] command.
325
+
1. View the output of the GPU-enabled workload using the [`kubectl logs`][kubectl-logs] command.
324
326
325
327
```console
326
328
kubectl logs samples-tf-mnist-demo-smnr6
@@ -330,7 +332,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
330
332
331
333
```console
332
334
2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
333
-
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
335
+
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
334
336
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
335
337
pciBusID: 2fd7:00:00.0
336
338
totalMemory: 11.17GiB freeMemory: 11.10GiB
@@ -372,11 +374,11 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
372
374
373
375
## Clean up resources
374
376
375
-
* Remove the associated Kubernetes objects you created in this article using the [`kubectl delete job`][kubectl delete] command.
377
+
Remove the associated Kubernetes objects you created in this article using the [`kubectl delete job`][kubectl delete] command.
376
378
377
-
```console
378
-
kubectl delete jobs samples-tf-mnist-demo
379
-
```
379
+
```console
380
+
kubectl delete jobs samples-tf-mnist-demo
381
+
```
380
382
381
383
## Next steps
382
384
@@ -423,4 +425,3 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
0 commit comments