Skip to content

Commit 2d7eb7d

Browse files
BihanBihan  Rana
andauthored
Update TGI Example with Llama 4 Scout (#2529)
* Update TGI Example with Llama 4 Scout * Update examples.md --------- Co-authored-by: Bihan Rana <[email protected]>
1 parent 99a88d3 commit 2d7eb7d

File tree

3 files changed

+41
-31
lines changed

3 files changed

+41
-31
lines changed

docs/examples.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ hide:
3838
TGI
3939
</h3>
4040
<p>
41-
Deploy Llama 3.1 with TGI
41+
Deploy Llama 4 with TGI
4242
</p>
4343
</a>
4444
<a href="/examples/deployment/nim"

examples/deployment/tgi/.dstack.yml

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,24 @@
11
type: service
2-
name: llama31
2+
name: llama4-scout
33

44
image: ghcr.io/huggingface/text-generation-inference:latest
5+
56
env:
67
- HF_TOKEN
7-
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
8-
- MAX_INPUT_LENGTH=4000
9-
- MAX_TOTAL_TOKENS=4096
8+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
9+
- MAX_INPUT_LENGTH=8192
10+
- MAX_TOTAL_TOKENS=16384
11+
# max_batch_prefill_tokens must be >= max_input_tokens
12+
- MAX_BATCH_PREFILL_TOKENS=8192
1013
commands:
11-
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
14+
# Activate the virtual environment at /usr/src/.venv/
15+
# as required by TGI's latest image.
16+
- . /usr/src/.venv/bin/activate
17+
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
18+
1219
port: 80
1320
# Register the model
14-
model: meta-llama/Meta-Llama-3.1-8B-Instruct
21+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
1522

1623
# Uncomment to leverage spot instances
1724
#spot_policy: auto
@@ -21,6 +28,5 @@ model: meta-llama/Meta-Llama-3.1-8B-Instruct
2128
# - /data:/data
2229

2330
resources:
24-
gpu: 24GB
25-
# Uncomment if using multiple GPUs
26-
#shm_size: 24GB
31+
gpu: H200:2
32+
disk: 500GB..

examples/deployment/tgi/README.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: HuggingFace TGI
3-
description: "This example shows how to deploy Llama 3.1 to any cloud or on-premises environment using HuggingFace TGI and dstack."
3+
description: "This example shows how to deploy Llama 4 Scout to any cloud or on-premises environment using HuggingFace TGI and dstack."
44
---
55

66
# HuggingFace TGI
77

8-
This example shows how to deploy Llama 3.1 8B with `dstack` using [HuggingFace TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/index){:target="_blank"}.
8+
This example shows how to deploy Llama 4 Scout with `dstack` using [HuggingFace TGI :material-arrow-top-right-thin:{ .external }](https://huggingface.co/docs/text-generation-inference/en/index){:target="_blank"}.
99

1010
??? info "Prerequisites"
1111
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
@@ -22,37 +22,43 @@ This example shows how to deploy Llama 3.1 8B with `dstack` using [HuggingFace T
2222

2323
## Deployment
2424

25-
Here's an example of a service that deploys Llama 3.1 8B using TGI.
25+
Here's an example of a service that deploys [`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"} using TGI.
2626

2727
<div editor-title="examples/deployment/tgi/.dstack.yml">
2828

2929
```yaml
3030
type: service
31-
name: llama31
31+
name: llama4-scout
3232

3333
image: ghcr.io/huggingface/text-generation-inference:latest
34+
3435
env:
3536
- HF_TOKEN
36-
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
37-
- MAX_INPUT_LENGTH=4000
38-
- MAX_TOTAL_TOKENS=4096
37+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
38+
- MAX_INPUT_LENGTH=8192
39+
- MAX_TOTAL_TOKENS=16384
40+
# max_batch_prefill_tokens must be >= max_input_tokens
41+
- MAX_BATCH_PREFILL_TOKENS=8192
3942
commands:
40-
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
43+
# Activate the virtual environment at /usr/src/.venv/
44+
# as required by TGI's latest image.
45+
- . /usr/src/.venv/bin/activate
46+
- NUM_SHARD=$DSTACK_GPUS_NUM text-generation-launcher
47+
4148
port: 80
4249
# Register the model
43-
model: meta-llama/Meta-Llama-3.1-8B-Instruct
50+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
4451

4552
# Uncomment to leverage spot instances
4653
#spot_policy: auto
4754

48-
# Uncomment to cache downloaded models
55+
# Uncomment to cache downloaded models
4956
#volumes:
5057
# - /data:/data
5158

5259
resources:
53-
gpu: 24GB
54-
# Uncomment if using multiple GPUs
55-
#shm_size: 24GB
60+
gpu: H200:2
61+
disk: 500GB..
5662
```
5763
</div>
5864
@@ -66,12 +72,11 @@ To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/referenc
6672
$ HF_TOKEN=...
6773
$ dstack apply -f examples/deployment/tgi/.dstack.yml
6874
69-
# BACKEND REGION RESOURCES SPOT PRICE
70-
1 tensordock unitedstates 2xCPU, 10GB, 1xRTX3090 (24GB) no $0.231
71-
2 tensordock unitedstates 2xCPU, 10GB, 1xRTX3090 (24GB) no $0.242
72-
3 tensordock india 2xCPU, 38GB, 1xA5000 (24GB) no $0.283
75+
# BACKEND REGION RESOURCES SPOT PRICE
76+
1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
77+
2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
7378
74-
Submit a new run? [y/n]: y
79+
Submit the run llama4-scout? [y/n]: y
7580
7681
Provisioning...
7782
---> 100%
@@ -89,7 +94,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
8994
-H 'Authorization: Bearer &lt;dstack token&gt;' \
9095
-H 'Content-Type: application/json' \
9196
-d '{
92-
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
97+
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
9398
"messages": [
9499
{
95100
"role": "system",
@@ -117,7 +122,6 @@ The source-code of this example can be found in
117122
## What's next?
118123

119124
1. Check [services](https://dstack.ai/docs/services)
120-
2. Browse the [Llama 3.1](https://dstack.ai/examples/llms/llama31/), [vLLM](https://dstack.ai/examples/deployment/vllm/),
121-
and [NIM](https://dstack.ai/examples/deployment/nim/) examples
125+
2. Browse the [Llama](https://dstack.ai/examples/llms/llama/), [vLLM](https://dstack.ai/examples/deployment/vllm/), [SgLang](https://dstack.ai/examples/deployment/sglang/) and [NIM](https://dstack.ai/examples/deployment/nim/) examples
122126
3. See also [AMD](https://dstack.ai/examples/accelerators/amd/) and
123127
[TPU](https://dstack.ai/examples/accelerators/tpu/)

0 commit comments

Comments
 (0)