Skip to content

stable-diffusion: TAESD implementation - faster autoencoder #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
802bfaf
dummy implementation of TAESD
FSSRepo Nov 26, 2023
6604eab
taesd decoder gpu offloading
FSSRepo Nov 26, 2023
190a59f
fix assert backend == GGML_BACKEND_GPU debug
FSSRepo Nov 27, 2023
6bf653c
enable taesd encoder
FSSRepo Nov 27, 2023
211c274
add taesd model
FSSRepo Nov 27, 2023
c652774
apply scale to input data
FSSRepo Nov 27, 2023
8f15c3a
fix latent prop
FSSRepo Nov 27, 2023
9710dd4
TAESD Encoder fixed - working correctly
FSSRepo Nov 27, 2023
84ad9fb
fix some convert bugs
FSSRepo Nov 27, 2023
15add38
Merge branch 'leejet:master' into taesd-impl
FSSRepo Nov 27, 2023
254fab7
show seed when generating image with -s -1
FSSRepo Nov 27, 2023
82a0549
Merge branch 'taesd-impl' of https://github.com/FSSRepo/stable-diffus…
FSSRepo Nov 27, 2023
3dcca60
update docs
FSSRepo Nov 28, 2023
613d65e
less restrictive with larger images
FSSRepo Nov 28, 2023
c957680
cuda: im2col speedup x2
FSSRepo Nov 28, 2023
585f00e
Merge branch 'leejet:master' into taesd-impl
FSSRepo Nov 28, 2023
99e9ec1
cuda: group norm speedup x90
FSSRepo Nov 28, 2023
2d760b7
Merge branch 'taesd-impl' of https://github.com/FSSRepo/stable-diffus…
FSSRepo Nov 28, 2023
4aa00e5
quantized models now works in cuda :)
FSSRepo Nov 29, 2023
864ab9f
fix ignore models folder
FSSRepo Nov 29, 2023
48897d6
Merge branch 'master' into taesd-impl
leejet Dec 3, 2023
f8c5776
remove merge oopsy
leejet Dec 3, 2023
7e4197f
Merge branch 'master' into taesd-impl
leejet Dec 3, 2023
40f1c4a
fix group norm cuda + some fixes taesd + rm vae paddings
FSSRepo Dec 4, 2023
689bc26
fix cal mem size
leejet Dec 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ test/
*.bin
*.exe
*.gguf
*.log
output.png
models/
output*.png
models*
!taesd-model.gguf
*.log
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,23 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
## Features

- Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
- Super lightweight and without external dependencies.
- Super lightweight and without external dependencies
- SD1.x and SD2.x support
- 16-bit, 32-bit float support
- 4-bit, 5-bit and 8-bit integer quantization support
- Accelerated memory-efficient CPU inference
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
- AVX, AVX2 and AVX512 support for x86 architectures
- Full CUDA backend for GPU acceleration, for now just for float16 and float32 models. There are some issues with quantized models and CUDA; it will be fixed in the future.
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models.
- Full CUDA backend for GPU acceleration.
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
- No need to convert to `.ggml` or `.gguf` anymore!
- Flash Attention for memory usage optimization (only cpu for now).
- Flash Attention for memory usage optimization (only cpu for now)
- Original `txt2img` and `img2img` mode
- Negative prompt
- [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
- LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
- Latent Consistency Models support (LCM/LCM-LoRA)
- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
- Sampling method
- `Euler A`
- `Euler`
Expand All @@ -47,9 +48,10 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
- [ ] More sampling methods
- [ ] Make inference faster
- The current implementation of ggml_conv_2d is slow and has high memory usage
- Implement Winograd Convolution 2D for 3x3 kernel filtering
- [ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
- [ ] Implement BPE Tokenizer
- [ ] Add [TAESD](https://github.com/madebyollin/taesd) for faster VAE decoding
- [ ] Implement [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN/tree/master) upscaler
- [ ] k-quants support

## Usage
Expand Down Expand Up @@ -122,7 +124,7 @@ cmake --build . --config Release
### Run

```
usage: ./bin/sd [arguments]
usage: sd [arguments]

arguments:
-h, --help show this help message and exit
Expand All @@ -131,8 +133,10 @@ arguments:
If threads <= 0, then threads will be set to the number of CPU physical cores
-m, --model [MODEL] path to model
--vae [VAE] path to vae
--taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
If not specified, the default is the type of the weight file. --lora-model-dir [DIR] lora model directory
If not specified, the default is the type of the weight file.
--lora-model-dir [DIR] lora model directory
-i, --init-img [IMAGE] path to the input image, required by img2img
-o, --output OUTPUT path to write result image to (default: ./output.png)
-p, --prompt [PROMPT] the prompt to render
Expand Down Expand Up @@ -218,6 +222,23 @@ Here's a simple example:
| ---- |---- |
| ![](./assets/without_lcm.png) |![](./assets/with_lcm.png) |

## Using TAESD to faster decoding

You can use TAESD to accelerate the decoding of latent images by following these steps:

- Download the model [weights](https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors).

Or curl

```bash
curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors
```

- Specify the model path using the `--taesd PATH` parameter. example:

```bash
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors
```

### Docker

Expand Down
Loading