Skip to content

Commit bf64b32

Browse files
ariG23498sayakpaulpcuencastevhliu
authored
[Guide] Quantize your Diffusion Models with bnb (#10012)
* chore: initial draft * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Steven Liu <[email protected]> * chore: link in place * chore: review suggestions * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * chore: review suggestions * Update docs/source/en/quantization/bitsandbytes.md Co-authored-by: Steven Liu <[email protected]> * review suggestions * chore: review suggestions * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * adding same changes to 4 bit section * review suggestions --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Steven Liu <[email protected]>
1 parent 3335e22 commit bf64b32

File tree

1 file changed

+205
-49
lines changed

1 file changed

+205
-49
lines changed

docs/source/en/quantization/bitsandbytes.md

Lines changed: 205 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ specific language governing permissions and limitations under the License.
1717

1818
4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
1919

20+
This guide demonstrates how quantization can enable running
21+
[FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev)
22+
on less than 16GB of VRAM and even on a free Google
23+
Colab instance.
24+
25+
![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png)
2026

2127
To use bitsandbytes, make sure you have the following libraries installed:
2228

@@ -31,70 +37,167 @@ Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixi
3137

3238
Quantizing a model in 8-bit halves the memory-usage:
3339

40+
bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the
41+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
42+
43+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
44+
45+
> [!TIP]
46+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
47+
3448
```py
35-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
49+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
50+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
3651

37-
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
52+
from diffusers import FluxTransformer2DModel
53+
from transformers import T5EncoderModel
3854

39-
model_8bit = FluxTransformer2DModel.from_pretrained(
40-
"black-forest-labs/FLUX.1-dev",
55+
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)
56+
57+
text_encoder_2_8bit = T5EncoderModel.from_pretrained(
58+
"black-forest-labs/FLUX.1-dev",
59+
subfolder="text_encoder_2",
60+
quantization_config=quant_config,
61+
torch_dtype=torch.float16,
62+
)
63+
64+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)
65+
66+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
67+
"black-forest-labs/FLUX.1-dev",
4168
subfolder="transformer",
42-
quantization_config=quantization_config
69+
quantization_config=quant_config,
70+
torch_dtype=torch.float16,
4371
)
4472
```
4573

46-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
74+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
4775

48-
```py
49-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
76+
```diff
77+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
78+
"black-forest-labs/FLUX.1-dev",
79+
subfolder="transformer",
80+
quantization_config=quant_config,
81+
+ torch_dtype=torch.float32,
82+
)
83+
```
5084

51-
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
85+
Let's generate an image using our quantized models.
5286

53-
model_8bit = FluxTransformer2DModel.from_pretrained(
54-
"black-forest-labs/FLUX.1-dev",
55-
subfolder="transformer",
56-
quantization_config=quantization_config,
57-
torch_dtype=torch.float32
87+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the
88+
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
89+
90+
```py
91+
pipe = FluxPipeline.from_pretrained(
92+
"black-forest-labs/FLUX.1-dev",
93+
transformer=transformer_8bit,
94+
text_encoder_2=text_encoder_2_8bit,
95+
torch_dtype=torch.float16,
96+
device_map="auto",
5897
)
59-
model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype
98+
99+
pipe_kwargs = {
100+
"prompt": "A cat holding a sign that says hello world",
101+
"height": 1024,
102+
"width": 1024,
103+
"guidance_scale": 3.5,
104+
"num_inference_steps": 50,
105+
"max_sequence_length": 512,
106+
}
107+
108+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
60109
```
61110

62-
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
111+
<div class="flex justify-center">
112+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/8bit.png"/>
113+
</div>
114+
115+
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
116+
117+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
63118

64119
</hfoption>
65120
<hfoption id="4-bit">
66121

67122
Quantizing a model in 4-bit reduces your memory-usage by 4x:
68123

124+
bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the
125+
[`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`].
126+
127+
For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.
128+
129+
> [!TIP]
130+
> The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers.
131+
69132
```py
70-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
133+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
134+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
71135

72-
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
136+
from diffusers import FluxTransformer2DModel
137+
from transformers import T5EncoderModel
73138

74-
model_4bit = FluxTransformer2DModel.from_pretrained(
75-
"black-forest-labs/FLUX.1-dev",
139+
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)
140+
141+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
142+
"black-forest-labs/FLUX.1-dev",
143+
subfolder="text_encoder_2",
144+
quantization_config=quant_config,
145+
torch_dtype=torch.float16,
146+
)
147+
148+
quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)
149+
150+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
151+
"black-forest-labs/FLUX.1-dev",
76152
subfolder="transformer",
77-
quantization_config=quantization_config
153+
quantization_config=quant_config,
154+
torch_dtype=torch.float16,
78155
)
79156
```
80157

81-
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
158+
By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.
82159

83-
```py
84-
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig
160+
```diff
161+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
162+
"black-forest-labs/FLUX.1-dev",
163+
subfolder="transformer",
164+
quantization_config=quant_config,
165+
+ torch_dtype=torch.float32,
166+
)
167+
```
85168

86-
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
169+
Let's generate an image using our quantized models.
87170

88-
model_4bit = FluxTransformer2DModel.from_pretrained(
89-
"black-forest-labs/FLUX.1-dev",
90-
subfolder="transformer",
91-
quantization_config=quantization_config,
92-
torch_dtype=torch.float32
171+
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
172+
173+
```py
174+
pipe = FluxPipeline.from_pretrained(
175+
"black-forest-labs/FLUX.1-dev",
176+
transformer=transformer_4bit,
177+
text_encoder_2=text_encoder_2_4bit,
178+
torch_dtype=torch.float16,
179+
device_map="auto",
93180
)
94-
model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype
181+
182+
pipe_kwargs = {
183+
"prompt": "A cat holding a sign that says hello world",
184+
"height": 1024,
185+
"width": 1024,
186+
"guidance_scale": 3.5,
187+
"num_inference_steps": 50,
188+
"max_sequence_length": 512,
189+
}
190+
191+
image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0]
95192
```
96193

97-
Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
194+
<div class="flex justify-center">
195+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/4bit.png"/>
196+
</div>
197+
198+
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage.
199+
200+
Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
98201

99202
</hfoption>
100203
</hfoptions>
@@ -199,17 +302,34 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty
199302
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
200303

201304
```py
202-
from diffusers import BitsAndBytesConfig
305+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
306+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
307+
308+
from diffusers import FluxTransformer2DModel
309+
from transformers import T5EncoderModel
203310

204-
nf4_config = BitsAndBytesConfig(
311+
quant_config = TransformersBitsAndBytesConfig(
205312
load_in_4bit=True,
206313
bnb_4bit_quant_type="nf4",
207314
)
208315

209-
model_nf4 = SD3Transformer2DModel.from_pretrained(
210-
"stabilityai/stable-diffusion-3-medium-diffusers",
316+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
317+
"black-forest-labs/FLUX.1-dev",
318+
subfolder="text_encoder_2",
319+
quantization_config=quant_config,
320+
torch_dtype=torch.float16,
321+
)
322+
323+
quant_config = DiffusersBitsAndBytesConfig(
324+
load_in_4bit=True,
325+
bnb_4bit_quant_type="nf4",
326+
)
327+
328+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
329+
"black-forest-labs/FLUX.1-dev",
211330
subfolder="transformer",
212-
quantization_config=nf4_config,
331+
quantization_config=quant_config,
332+
torch_dtype=torch.float16,
213333
)
214334
```
215335

@@ -220,38 +340,74 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa
220340
Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.
221341

222342
```py
223-
from diffusers import BitsAndBytesConfig
343+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
344+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
345+
346+
from diffusers import FluxTransformer2DModel
347+
from transformers import T5EncoderModel
224348

225-
double_quant_config = BitsAndBytesConfig(
349+
quant_config = TransformersBitsAndBytesConfig(
226350
load_in_4bit=True,
227351
bnb_4bit_use_double_quant=True,
228352
)
229353

230-
double_quant_model = SD3Transformer2DModel.from_pretrained(
231-
"stabilityai/stable-diffusion-3-medium-diffusers",
354+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
355+
"black-forest-labs/FLUX.1-dev",
356+
subfolder="text_encoder_2",
357+
quantization_config=quant_config,
358+
torch_dtype=torch.float16,
359+
)
360+
361+
quant_config = DiffusersBitsAndBytesConfig(
362+
load_in_4bit=True,
363+
bnb_4bit_use_double_quant=True,
364+
)
365+
366+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
367+
"black-forest-labs/FLUX.1-dev",
232368
subfolder="transformer",
233-
quantization_config=double_quant_config,
369+
quantization_config=quant_config,
370+
torch_dtype=torch.float16,
234371
)
235372
```
236373

237374
## Dequantizing `bitsandbytes` models
238375

239-
Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.
376+
Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model.
240377

241378
```python
242-
from diffusers import BitsAndBytesConfig
379+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
380+
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
243381

244-
double_quant_config = BitsAndBytesConfig(
382+
from diffusers import FluxTransformer2DModel
383+
from transformers import T5EncoderModel
384+
385+
quant_config = TransformersBitsAndBytesConfig(
245386
load_in_4bit=True,
246387
bnb_4bit_use_double_quant=True,
247388
)
248389

249-
double_quant_model = SD3Transformer2DModel.from_pretrained(
250-
"stabilityai/stable-diffusion-3-medium-diffusers",
390+
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
391+
"black-forest-labs/FLUX.1-dev",
392+
subfolder="text_encoder_2",
393+
quantization_config=quant_config,
394+
torch_dtype=torch.float16,
395+
)
396+
397+
quant_config = DiffusersBitsAndBytesConfig(
398+
load_in_4bit=True,
399+
bnb_4bit_use_double_quant=True,
400+
)
401+
402+
transformer_4bit = FluxTransformer2DModel.from_pretrained(
403+
"black-forest-labs/FLUX.1-dev",
251404
subfolder="transformer",
252-
quantization_config=double_quant_config,
405+
quantization_config=quant_config,
406+
torch_dtype=torch.float16,
253407
)
254-
model.dequantize()
408+
409+
text_encoder_2_4bit.dequantize()
410+
transformer_4bit.dequantize()
255411
```
256412

257413
## Resources

0 commit comments

Comments
 (0)