dstackai · Bihan · Apr 17, 2025 · Apr 10, 2025 · Apr 16, 2025 · peterschmidt85
diff --git a/docs/examples.md b/docs/examples.md
@@ -62,7 +62,7 @@ hide:
         </h3>
 
         <p>
-            Fine-tune Llama 3 on a custom dataset using Axolotl.
+            Fine-tune Llama 4 on a custom dataset using Axolotl.
         </p>
     </a>
 

diff --git a/examples/accelerators/amd/README.md b/examples/accelerators/amd/README.md
@@ -161,13 +161,18 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
 
     ```yaml
     type: task
+    # The name is optional, if not specified, generated randomly
     name: axolotl-amd-llama31-train
-    
+
     # Using RunPod's ROCm Docker image
     image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
     # Required environment variables
     env:
       - HF_TOKEN
+      - WANDB_API_KEY
+      - WANDB_PROJECT
+      - WANDB_NAME=axolotl-amd-llama31-train
+      - HUB_MODEL_ID
     # Commands of the task
     commands:
       - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
@@ -177,6 +182,9 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
       - cd axolotl
       - git checkout d4f6c65
       - pip install -e .
+      # Latest pynvml is not compatible with axolotl commit d4f6c65, so we need to fall back to version 11.5.3
+      - pip uninstall pynvml -y
+      - pip install pynvml==11.5.3
       - cd ..
       - wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
       - pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
@@ -190,18 +198,18 @@ To request multiple GPUs, specify the quantity after the GPU name, separated by
       - make
       - pip install .
       - cd ..
-      - accelerate launch -m axolotl.cli.train axolotl/examples/llama-3/fft-8b.yaml
-
-    # Uncomment to leverage spot instances
-    #spot_policy: auto
+      - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml 
+              --wandb-project "$WANDB_PROJECT" 
+              --wandb-name "$WANDB_NAME" 
+              --hub-model-id "$HUB_MODEL_ID"
 
     resources:
       gpu: MI300X
       disk: 150GB
     ```
     </div>
 
-    Note, to support ROCm, we need to checkout to commit `d4f6c65`. You can find the installation instruction in [rocm-blogs :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.
+    Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround :material-arrow-top-right-thin:{ .external }](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile) :material-arrow-top-right-thin:{ .external }](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm){:target="_blank"}.
 
     > To speed up installation of `flash-attention` and `xformers `, we use pre-built binaries uploaded to S3. 
     > You can find the tasks that build and upload the binaries
@@ -216,6 +224,10 @@ cloud resources and run the configuration.
 
 ```shell
 $ HF_TOKEN=...
+$ WANDB_API_KEY=...
+$ WANDB_PROJECT=...
+$ WANDB_NAME=axolotl-amd-llama31-train
+$ HUB_MODEL_ID=...
 $ dstack apply -f examples/deployment/vllm/amd/.dstack.yml
 ```
 

diff --git a/examples/fine-tuning/axolotl/.dstack.yaml b/examples/fine-tuning/axolotl/.dstack.yaml
@@ -1,23 +1,25 @@
 type: task
 # The name is optional, if not specified, generated randomly
-name: axolotl-train
+name: axolotl-nvidia-llama-scout-train
 
 # Using the official Axolotl's Docker image
-image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.1
+image: axolotlai/axolotl:main-latest
 
 # Required environment variables
 env:
   - HF_TOKEN
   - WANDB_API_KEY
+  - WANDB_PROJECT
+  - WANDB_NAME=axolotl-nvidia-llama-scout-train
+  - HUB_MODEL_ID
 # Commands of the task
 commands:
-  - accelerate launch -m axolotl.cli.train examples/fine-tuning/axolotl/config.yaml
+  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
+  - axolotl train scout-qlora-fsdp1.yaml --wandb-project $WANDB_PROJECT --wandb-name $WANDB_NAME --hub-model-id $HUB_MODEL_ID
 
 resources:
-  gpu:
-    # 24GB or more vRAM
-    memory: 24GB..
-    # Two or more GPU (required by FSDP)
-    count: 2..
+  # Two GPU (required by FSDP)
+  gpu: H100:2
   # Shared memory size for inter-process communication
   shm_size: 24GB
+  disk: 500GB..
diff --git a/examples/fine-tuning/axolotl/README.md b/examples/fine-tuning/axolotl/README.md
@@ -1,7 +1,7 @@
 # Axolotl
 
 This example shows how use [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/OpenAccess-AI-Collective/axolotl){:target="_blank"} 
-with `dstack` to fine-tune Llama3 8B using FSDP and QLoRA.
+with `dstack` to fine-tune 4-bit Quantized [Llama-4-Scout-17B-16E :material-arrow-top-right-thin:{ .external }](https://huggingface.co/axolotl-quants/Llama-4-Scout-17B-16E-Linearized-bnb-nf4-bf16){:target="_blank"} using FSDP and QLoRA.
 
 ??? info "Prerequisites"
     Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
@@ -18,44 +18,45 @@ with `dstack` to fine-tune Llama3 8B using FSDP and QLoRA.
 
 ## Training configuration recipe
 
-Axolotl reads the model, LoRA, and dataset arguments, as well as trainer configuration from a YAML file. This file can
-be found at [`examples/fine-tuning/axolotl/config.yaml` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/config.yaml){:target="_blank"}.
-You can modify it as needed.
+Axolotl reads the model, QLoRA, and dataset arguments, as well as trainer configuration from a [`scout-qlora-fsdp1.yaml` :material-arrow-top-right-thin:{ .external }](https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-4/scout-qlora-fsdp1.yaml){:target="_blank"} file. The configuration uses 4-bit axolotl quantized version of `meta-llama/Llama-4-Scout-17B-16E`, requiring only ~43GB VRAM/GPU with 4K context length.
 
-> Before you proceed with training, make sure to update the `hub_model_id` in [`examples/fine-tuning/axolotl/config.yaml` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/alignment-handbook/config.yaml){:target="_blank"}
-> with your HuggingFace username.
 
 ## Single-node training
 
 The easiest way to run a training script with `dstack` is by creating a task configuration file.
-This file can be found at [`examples/fine-tuning/axolotl/train.dstack.yml` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/train.dstack.yml){:target="_blank"}.
+This file can be found at [`examples/fine-tuning/axolotl/.dstack.yml` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl/.dstack.yaml){:target="_blank"}.
 
 <div editor-title="examples/fine-tuning/axolotl/.dstack.yml">
 
 ```yaml
 type: task
-name: axolotl-train
+# The name is optional, if not specified, generated randomly
+name: axolotl-nvidia-llama-scout-train
 
 # Using the official Axolotl's Docker image
-image: winglian/axolotl-cloud:main-20240429-py3.11-cu121-2.2.1
+image: axolotlai/axolotl:main-latest
 
 # Required environment variables
 env:
   - HF_TOKEN
   - WANDB_API_KEY
+  - WANDB_PROJECT
+  - WANDB_NAME=axolotl-nvidia-llama-scout-train
+  - HUB_MODEL_ID
 # Commands of the task
 commands:
-  - accelerate launch -m axolotl.cli.train examples/fine-tuning/axolotl/config.yaml
-
-# Uncomment to leverage spot instances
-#spot_policy: auto
+  - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
+  - axolotl train scout-qlora-fsdp1.yaml 
+            --wandb-project $WANDB_PROJECT 
+            --wandb-name $WANDB_NAME 
+            --hub-model-id $HUB_MODEL_ID
 
 resources:
-  gpu:
-    # 24GB or more vRAM
-    memory: 24GB..
-    # Two or more GPU
-    count: 2..
+  # Two GPU (required by FSDP)
+  gpu: H100:2
+  # Shared memory size for inter-process communication
+  shm_size: 24GB
+  disk: 500GB..
 ```
 
 </div>
@@ -75,6 +76,9 @@ cloud resources and run the configuration.
 ```shell
 $ HF_TOKEN=...
 $ WANDB_API_KEY=...
+$ WANDB_PROJECT=...
+$ WANDB_NAME=axolotl-nvidia-llama-scout-train
+$ HUB_MODEL_ID=...
 $ dstack apply -f examples/fine-tuning/axolotl/.dstack.yml
 ```
 

diff --git a/examples/fine-tuning/axolotl/amd/.dstack.yml b/examples/fine-tuning/axolotl/amd/.dstack.yml
@@ -1,12 +1,14 @@
 type: task
 # The name is optional, if not specified, generated randomly
 name: axolotl-amd-llama31-train
-
 image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
-
 # Required environment variables
 env:
   - HF_TOKEN
+  - WANDB_API_KEY
+  - WANDB_PROJECT
+  - WANDB_NAME=axolotl-amd-llama31-train
+  - HUB_MODEL_ID
 # Commands of the task
 commands:
   - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
@@ -16,6 +18,9 @@ commands:
   - cd axolotl
   - git checkout d4f6c65
   - pip install -e .
+  # Latest pynvml is not compatible with axolotl commit d4f6c65, so we need to fall back to version 11.5.3
+  - pip uninstall pynvml -y
+  - pip install pynvml==11.5.3
   - cd ..
   - wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
   - pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
@@ -29,7 +34,10 @@ commands:
   - make
   - pip install .
   - cd ..
-  - accelerate launch -m axolotl.cli.train axolotl/examples/llama-3/fft-8b.yaml
+  - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml 
+          --wandb-project "$WANDB_PROJECT" 
+          --wandb-name "$WANDB_NAME" 
+          --hub-model-id "$HUB_MODEL_ID"
 
 resources:
   gpu: MI300X
-Original file line number
+Diff line change
@@ Expand Up / @@ -62,7 +62,7 @@ hide: @@
             </h3>
             <p>
-                Fine-tune Llama 3 on a custom dataset using Axolotl.
+                Fine-tune Llama 4 on a custom dataset using Axolotl.
             </p>
         </a>
@@ Expand Down @@