You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ServerlessLLM currently supports model quantization using `bitsandbytes` through the Hugging Face Transformers' `BitsAndBytesConfig`.
187
+
188
+
Available precisions include:
189
+
-`int8`
190
+
-`fp4`
191
+
-`nf4`
192
+
193
+
For further information, consult the [HuggingFace Documentation for BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes)
194
+
195
+
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
196
+
197
+
### Usage
198
+
To use quantization, create a `BitsAndBytesConfig` object with your desired settings:
0 commit comments