-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Add a ControlNet model & pipeline #2407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 91 commits
6123837
d382f93
04a514a
1f4b706
25eb4e7
a7cb5a2
148b46d
584edfd
0f9781c
0327e73
e5cabdf
1fc01a3
cb7bb9a
87ed105
efccecc
a838366
a296de9
bd51c6d
7656925
839e009
cf16a43
ce0e571
9cc8b99
7dbbe22
a316d86
b17fd20
894bd84
3d3a02f
38bf48d
cc597a1
4bcc159
33841b6
9a37409
0a1bb45
90d05e9
189f46f
3ade8c0
d5965c7
79c0ecb
04f9b8a
fe82f10
1c7d311
e492e9d
2eab486
faf1cfb
f656952
6300a52
bac69f1
4f394a8
2b0f04b
c6c7312
bd5d7b7
2c0d4d4
808376c
cd85086
e376edb
19be7e6
b74ef10
b8e689e
2d8cca1
0f70cf5
91623a9
855580d
e758682
788e03d
42ebc45
8bb964d
49024f6
30f7570
e169f32
e1b8b49
2953f9f
f5cd24a
1a02798
8f01ca1
ca4378e
1799d83
d7b95cf
2e86e1f
bb03069
9a14567
71d0a96
16efb00
1b0af7d
53f4523
161aac2
349f3bf
3f6e8f7
ebabcbe
30e2bde
a00c9ca
f099be5
2b553d9
238e26f
d49296c
d1cd65a
7eb43f1
032d5e0
5c7dbb3
cdbc7c4
86c1684
a18fc70
9ec6ad4
8fd8e42
7c35fc7
e200797
e6973eb
0ba19da
1a803d1
8dea9c7
60e3635
b15fca9
d7ed0b1
0ed0581
5e16d13
06bb1db
acf8d26
3981459
b799512
0810e4c
10dedd9
042c75e
ff2e691
9cb8816
8f62631
3bbc356
b8d1908
1f36d9e
8052dde
a610e47
9947fcb
592f389
ec4fc3a
b010e3c
547ba02
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,103 @@ | ||||||||||||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | ||||||||||||||
|
||||||||||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||||||||||
the License. You may obtain a copy of the License at | ||||||||||||||
|
||||||||||||||
http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||
|
||||||||||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||||||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||||||||||
specific language governing permissions and limitations under the License. | ||||||||||||||
--> | ||||||||||||||
|
||||||||||||||
# Text-to-Image Generation with ControlNet Conditioning | ||||||||||||||
|
||||||||||||||
## StableDiffusionControlNetPipeline | ||||||||||||||
|
||||||||||||||
ControlNet by [@lllyasviel](https://huggingface.co/lllyasviel) is a neural network structure to control diffusion models by adding extra conditions. | ||||||||||||||
|
||||||||||||||
There are 8 pre-trained ControlNet models that were trained to condition the original Stable Diffusion model on different inputs, | ||||||||||||||
such as edge detection, scribbles, depth maps, semantic segmentations and more. | ||||||||||||||
|
||||||||||||||
Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. | ||||||||||||||
|
||||||||||||||
The original codebase/paper can be found here: | ||||||||||||||
- [Code](https://github.com/lllyasviel/ControlNet) | ||||||||||||||
- [Paper](https://arxiv.org/abs/2302.05543) | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Available checkpoints | ||||||||||||||
|
||||||||||||||
ControlNet requires a *control image* in addition to the text-to-image *prompt*. | ||||||||||||||
takuma104 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
Each pretrained model is trained using a different conditioning method that requires different conditioning images. For example, Canny edge conditioning requires the control image to be the output of a Canny filter, while depth conditioning requires the control image to be a depth map. | ||||||||||||||
See the overview and image examples. | ||||||||||||||
|
||||||||||||||
All checkpoints are converted from [lllyasviel/ControlNet](https://huggingface.co/lllyasviel/ControlNet). | ||||||||||||||
|
||||||||||||||
### ControlNet + Stable Diffusion 1.5 | ||||||||||||||
|
||||||||||||||
| Model Name | Control Image Overview| Control Image Example | Generated Image Example | | ||||||||||||||
|---|---|---|---| | ||||||||||||||
|[takuma104/control_sd15_canny](https://huggingface.co/takuma104/control_sd15_canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_canny.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_canny.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_canny_1.png"/></a>| | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should transfer these checkpoints to an appropriate location. Cc: @patrickvonplaten |
||||||||||||||
|[takuma104/control_sd15_depth](https://huggingface.co/takuma104/control_sd15_depth)<br/> *Trained with Midas depth estimation* |A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_depth.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_depth.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_depth_2.png"/></a>| | ||||||||||||||
|[takuma104/control_sd15_hed](https://huggingface.co/takuma104/control_sd15_hed)<br/> *Trained with HED edge detection (soft edge)* |A monochrome image with white soft edges on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_bird_hed.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_bird_hed.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_bird_hed_1.png"/></a> | | ||||||||||||||
|[takuma104/control_sd15_mlsd](https://huggingface.co/takuma104/control_sd15_mlsd)<br/> *Trained with M-LSD line detection* |A monochrome image composed only of white straight lines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_mlsd.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_mlsd.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_mlsd_0.png"/></a>| | ||||||||||||||
|[takuma104/control_sd15_normal](https://huggingface.co/takuma104/control_sd15_normal)<br/> *Trained with normal map* |A [normal mapped](https://en.wikipedia.org/wiki/Normal_mapping) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_normal.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_normal.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_normal_1.png"/></a>| | ||||||||||||||
|[takuma104/control_sd15_openpose](https://huggingface.co/takuma104/control_sd15_openpose)<br/> *Trained with OpenPose bone image* |A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_human_openpose.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_human_openpose.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_human_openpose_0.png"/></a>| | ||||||||||||||
|[takuma104/control_sd15_scribble](https://huggingface.co/takuma104/control_sd15_scribble)<br/> *Trained with human scribbles* |A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_vermeer_scribble.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_vermeer_scribble.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_vermeer_scribble_0.png"/></a> | | ||||||||||||||
|[takuma104/control_sd15_seg](https://huggingface.co/takuma104/control_sd15_seg)<br/>*Trained with semantic segmentation* |An [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/)'s segmentation protocol image.|<a href="https://huggingface.co/takuma104/controlnet_dev/blob/main/gen_compare/control_images/converted/control_room_seg.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/control_images/converted/control_room_seg.png"/></a>|<a href="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"><img width="64" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/gen_compare/output_images/diffusers/output_room_seg_1.png"/></a> | | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Resources | ||||||||||||||
|
||||||||||||||
- [Colab Notebook Example](https://colab.research.google.com/drive/1AiR7Q-sBqO88NCyswpfiuwXZc7DfMyKA?usp=sharing) | ||||||||||||||
- [controlnet_hinter](https://github.com/takuma104/controlnet_hinter): Image Preprocess Library for ControlNet | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We usually don't add a separate section for resources like this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a section for "Available Pipelines". I have left my Colab example, but it is okay to delete it as soon as Space for ControlNet becomes available. |
||||||||||||||
|
||||||||||||||
## Usage example | ||||||||||||||
|
||||||||||||||
- Basic Example (Canny Edge) | ||||||||||||||
|
||||||||||||||
The conditioning image is an outline of the image edges, as detected by a Canny filter. This is the example we'll use to control the generation: | ||||||||||||||
|
||||||||||||||
 | ||||||||||||||
takuma104 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
```python | ||||||||||||||
from diffusers import StableDiffusionControlNetPipeline | ||||||||||||||
from diffusers.utils import load_image | ||||||||||||||
|
||||||||||||||
# Canny edged image for control | ||||||||||||||
canny_edged_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/vermeer_canny_edged.png") | ||||||||||||||
takuma104 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
pipe = StableDiffusionControlNetPipeline.from_pretrained("takuma104/control_sd15_canny").to("cuda") | ||||||||||||||
image = pipe(prompt="best quality, extremely detailed", image=canny_edged_image).images[0] | ||||||||||||||
image.save("generated.png") | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
Note that the text prompt does not make any reference to the structure or contents of the image we are generating. Stable Diffusion interprets the control image as an additional input that controls what to generate. | ||||||||||||||
- Controlling custom Stable Diffusion 1.5 models | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you meant it as a heading? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 06bb1db |
||||||||||||||
|
||||||||||||||
In the following example we use PromptHero's [Openjourney model](https://huggingface.co/prompthero/openjourney), which was fine-tuned from the base Stable Diffusion v1.5 model on images from Midjourney. This model has the same structure as Stable Diffusion 1.5 but is capable of producing a different output style. | ||||||||||||||
takuma104 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
```py | ||||||||||||||
from diffusers import StableDiffusionControlNetPipeline, AutoencoderKL, UNet2DConditionModel | ||||||||||||||
from diffusers.utils import load_image | ||||||||||||||
|
||||||||||||||
# Canny edged image for control | ||||||||||||||
canny_edged_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/vermeer_canny_edged.png") | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 06bb1db |
||||||||||||||
|
||||||||||||||
base_model_id = "prompthero/openjourney" # an example: openjourney model | ||||||||||||||
vae = AutoencoderKL.from_pretrained(base_model_id, subfolder="vae").to("cuda") | ||||||||||||||
unet = UNet2DConditionModel.from_pretrained(base_model_id, subfolder="unet").to("cuda") | ||||||||||||||
|
||||||||||||||
pipe = StableDiffusionControlNetPipeline.from_pretrained("takuma104/control_sd15_canny", unet=unet, vae=vae).to("cuda") | ||||||||||||||
image = pipe(prompt="best quality, extremely detailed", image=canny_edged_image, width=512, height=512).images[0] | ||||||||||||||
image.save("generated.png") | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
[[autodoc]] StableDiffusionControlNetPipeline | ||||||||||||||
- all | ||||||||||||||
- __call__ | ||||||||||||||
- enable_attention_slicing | ||||||||||||||
- disable_attention_slicing | ||||||||||||||
- enable_vae_slicing | ||||||||||||||
- disable_vae_slicing | ||||||||||||||
- enable_xformers_memory_efficient_attention | ||||||||||||||
- disable_xformers_memory_efficient_attention | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can discard these.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 06bb1db |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,14 +27,15 @@ Depending on the use case, one should choose a technique accordingly. In many ca | |
Unless otherwise mentioned, these are techniques that work with existing models and don't require their own weights. | ||
|
||
1. [Instruct Pix2Pix](#instruct-pix2pix) | ||
2. [Pix2Pix Zero](#pix2pixzero) | ||
3. [Attend and Excite](#attend-and-excite) | ||
4. [Semantic Guidance](#semantic-guidance) | ||
5. [Self-attention Guidance](#self-attention-guidance) | ||
6. [Depth2Image](#depth2image) | ||
7. [MultiDiffusion Panorama](#multidiffusion-panorama) | ||
8. [DreamBooth](#dreambooth) | ||
9. [Textual Inversion](#textual-inversion) | ||
2. [Pix2Pix 0](#pix2pixzero) | ||
3. [Attend and excite](#attend-and-excite) | ||
4. [Semantic guidance](#semantic-guidance) | ||
5. [Self attention guidance](#self-attention-guidance) | ||
6. [Depth2image](#depth2image) | ||
7. [DreamBooth](#dreambooth) | ||
8. [Textual Inversion](#textual-inversion) | ||
10. [MultiDiffusion Panorama](#panorama) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can keep them untouched. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so. Shall I set all of them to "1."? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah basically maybe let's just revert to how they were originally? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 3981459 |
||
11. [ControlNet](#controlnet) | ||
|
||
## Instruct Pix2Pix | ||
|
||
|
@@ -146,3 +147,25 @@ See [here](../training/dreambooth) for more information on how to use it. | |
[Textual Inversion](../training/text_inversion) fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style. | ||
|
||
See [here](../training/text_inversion) for more information on how to use it. | ||
|
||
## MultiDiffusion Panorama | ||
|
||
[Paper](https://multidiffusion.github.io/) | ||
[Demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion) | ||
MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation processes can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. | ||
[MultiDiffusion Panorama](../api/pipelines/stable_diffusion/panorama) allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas). | ||
|
||
See [here](../api/pipelines/stable_diffusion/panorama) for more information on how to use it to generate panoramic images. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you rebase with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is where the manual merge was done when conflicts were encountered. I will fix it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 3981459 |
||
|
||
## ControlNet | ||
|
||
[Paper](https://arxiv.org/abs/2302.05543) | ||
|
||
[ControlNet](../api/pipelines/stable_diffusion/controlnet) is a neural network structure to control diffusion models by adding extra conditions. | ||
There are 8 pre-trained ControlNet models that were trained to condition the original Stable Diffusion model on different inputs, | ||
such as edge detection, scribbles, depth maps, semantic segmentations and more. | ||
|
||
Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. | ||
|
||
See [here](../api/pipelines/stable_diffusion/controlnet) for more information on how to use it. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to this documentation as a reference to see how we usually structure the pipeline documentation.
In particular, "Overview" should be the first sub-section. Under that, we include the paper name, the abstract, and other resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the paper name, however, feel free to add a one-liner to denote what ControlNet does briefly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback. I'm rewriting it taking reference from the Pix2Pix documentation.
Should the title of this document also be aligned with the title of the paper? I think the current title (Text-to-Image Generation with ControlNet Conditioning) is also clear.
The sentence following the paper name - I'm thinking of using the sentence @pcuenca wrote. Would it be better to make it a bit shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay to me.
This one sounds fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Already applied in this update. 06bb1db