Skip to content

Commit c943406

Browse files
committed
update docs
1 parent cc7efa2 commit c943406

File tree

1 file changed

+16
-4
lines changed

1 file changed

+16
-4
lines changed

README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Inference of Stable Diffusion and Flux in pure C/C++
2424
- Full CUDA, Metal, Vulkan and SYCL backend for GPU acceleration.
2525
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
2626
- No need to convert to `.ggml` or `.gguf` anymore!
27-
- Flash Attention for memory usage optimization (only cpu for now)
27+
- Flash Attention for memory usage optimization
2828
- Original `txt2img` and `img2img` mode
2929
- Negative prompt
3030
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
@@ -182,11 +182,20 @@ Example of text2img by using SYCL backend:
182182
183183
##### Using Flash Attention
184184
185-
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
185+
Enabling flash attention for the diffusion model reduces memory usage by varying amounts of MB.
186+
eg.:
187+
- flux 768x768 ~600mb
188+
- SD2 768x768 ~1400mb
189+
For most backends, it slows things down, but for cuda it generally speeds it up too.
190+
At the moment, it is only supported for some models and some backends (like cpu, cuda/rocm, metal).
186191
192+
Run by adding `--diffusion-fa` to the arguments and watch for:
187193
```
188-
cmake .. -DSD_FLASH_ATTN=ON
189-
cmake --build . --config Release
194+
[INFO ] stable-diffusion.cpp:312 - Using flash attention in the diffusion model
195+
```
196+
and the compute buffer shrink in the debug log:
197+
```
198+
[DEBUG] ggml_extend.hpp:1004 - flux compute buffer size: 650.00 MB(VRAM)
190199
```
191200
192201
### Run
@@ -239,6 +248,9 @@ arguments:
239248
--vae-tiling process vae in tiles to reduce memory usage
240249
--vae-on-cpu keep vae in cpu (for low vram)
241250
--clip-on-cpu keep clip in cpu (for low vram).
251+
--diffusion-fa use flash attention in the diffusion model (for low vram).
252+
Might lower quality, since it implies converting k and v to f16.
253+
This might crash if it is not supported by the backend.
242254
--control-net-cpu keep controlnet in cpu (for low vram)
243255
--canny apply canny preprocessor (edge detection)
244256
--color Colors the logging tags according to level

0 commit comments

Comments
 (0)