Skip to content

Commit f19fe14

Browse files
committed
Document Sync by Tina
1 parent 0e6edfa commit f19fe14

File tree

1 file changed

+49
-1
lines changed

1 file changed

+49
-1
lines changed

docs/stable/store/quickstart.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,4 +179,52 @@ for output in outputs:
179179
prompt = output.prompt
180180
generated_text = output.outputs[0].text
181181
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
182-
```
182+
```
183+
184+
## Quantization
185+
186+
ServerlessLLM currently supports model quantization using `bitsandbytes` through the Hugging Face Transformers' `BitsAndBytesConfig`.
187+
188+
Available precisions include:
189+
- `int8`
190+
- `fp4`
191+
- `nf4`
192+
193+
For further information, consult the [HuggingFace Documentation for BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes)
194+
195+
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
196+
197+
### Usage
198+
To use quantization, create a `BitsAndBytesConfig` object with your desired settings:
199+
200+
```python
201+
from transformers import BitsAndBytesConfig
202+
import torch
203+
204+
# For 8-bit quantization
205+
quantization_config = BitsAndBytesConfig(
206+
load_in_8bit=True
207+
)
208+
209+
# For 4-bit quantization (NF4)
210+
quantization_config = BitsAndBytesConfig(
211+
load_in_4bit=True,
212+
bnb_4bit_quant_type="nf4"
213+
)
214+
215+
# For 4-bit quantization (FP4)
216+
quantization_config = BitsAndBytesConfig(
217+
load_in_4bit=True,
218+
bnb_4bit_quant_type="fp4"
219+
)
220+
221+
# Then load your model with the config
222+
model = load_model(
223+
"facebook/opt-1.3b",
224+
device_map="auto",
225+
torch_dtype=torch.float16,
226+
storage_path="./models/",
227+
fully_parallel=True,
228+
quantization_config=quantization_config,
229+
)
230+
```

0 commit comments

Comments
 (0)