Ultimate ControlNet Handbook

ControlNet is a neural network that controls image generation in Stable Diffusion by adding extra conditions. Details can be found in the article Adding Conditional Control to Text-to-Image Diffusion Models by Lvmin Zhang and coworkers.

It is a game changer. You can use ControlNet to, to name a few,

  • Specify human poses.
  • Copy the composition from another image.
  • Generate a similar image.
  • Turn a scribble into a professional image.

In this post, You will learn everything you need to know about ControlNet.

  • What is ControlNet, and how it works.
  • How to install ControlNet on Windows, Mac, and Google Colab.
  • How to use ControlNet.
  • All ControlNet models explained.
  • Some usage examples.

This guide is for ControlNet with Stable Diffusion v1.5 models. See the guide for ControlNet with SDXL models.

Table of Contents

Toggle

What is ControlNet?

ControlNet is a neural network model for controlling Stable Diffusion models. You can use ControlNet along with any Stable Diffusion models.

The most basic form of using Stable Diffusion models is text-to-image. It uses text prompts as the conditioning to steer image generation so that you generate images that match the text prompt.

ControlNet adds one more conditioning in addition to the text prompt. The extra conditioning can take many forms in ControlNet.

Let me show you two examples of what ControlNet can do: Controlling image generation with (1) edge detection and (2) human pose detection.

Edge detection example

As illustrated below, ControlNet takes an additional input image and detects its outlines using the Canny edge detector. An image containing the detected edges is then saved as a control map. It is fed into the ControlNet model as an extra conditioning to the text prompt.

Stable Diffusion ControlNet with Canny edge conditioning.

The process of extracting specific information (edges in this case) from the input image is called annotation (in the research article) or preprocessing (in the ControlNet extension).

Human pose detection example

Edge detection is not the only way an image can be preprocessed. Openpose is a fast human keypoint detection model that can extract human poses like positions of hands, legs, and head. See the example below.

Input image annotated with human pose detection using Openpose.

Below is the ControlNet workflow using OpenPose. Keypoints are extracted from the input image using OpenPose, and saved as a control map containing the positions of key points. It is then fed to Stable Diffusion as an extra conditioning together with the text prompt. Images are generated based on these two conditionings.

What’s the difference between using Canny edge detection and Openpose? The Canny edge detector extracts the edges of the subject and background alike. It tends to translate the scene more faithfully. You can see the dancing man became a woman, but the outline and hairstyle are preserved.

OpenPose only detects human key points such as positions of the head, arms, etc. The image generation is more liberal but follows the original pose.

The above example generated a woman jumping up with the left foot pointing sideways, different from the original image and the one in the Canny Edge example. The reason is that OpenPose’s keypoint detection does not specify the orientations of the feet.

Installing Stable Diffusion ControlNet

(The instructions are updated for ControlNet v1.1)

Let’s walk through how to install ControlNet in AUTOMATIC1111, a popular and full-featured (and free!) Stable Diffusion GUI. We will use this extension, which is the de facto standard, for using ControlNet.

If you already have ControlNet installed, you can skip to the next section to learn how to use it.

Install ControlNet in Google Colab

It’s easy to use ControlNet with the 1-click Stable Diffusion Colab notebook in our Quick Start Guide.

In the Extensions section of the Colab notebook, check ControlNet.

Press the Play button to start AUTOMATIC1111. That’s it!

Install ControlNet on Windows PC or Mac

You can use ControlNet with AUTOMATIC1111 on Windows PC or Mac. Follow the instructions in these articles to install AUTOMATIC1111 if you have not already done so.

If you already have AUTOMATIC1111 installed, make sure your copy is up-to-date.

Install ControlNet extension (Windows/Mac)

  1. Navigate to the Extensions page.

2. Select the Install from URL tab.

3. Put the following URL in the URL for extension’s repository field.

https://github.com/Mikubill/sd-webui-controlnet

4. Click the Install button.

5. Wait for the confirmation message saying the extension is installed.

6. Restart AUTOMATIC1111.

7. Visit the ControlNet models page.

8. Download all model files (filename ending with .pth).

(If you don’t want to download all of them, you can download the openpose and canny models for now, which are most commonly used.)

9. Put the model file(s) in the ControlNet extension’s models directory.

stable-diffusion-webuiextensionssd-webui-controlnetmodels

Restart AUTOMATIC1111 webui.

If the extension is successfully installed, you will see a new collapsible section in the txt2img tab called ControlNet. It should be right above the Script drop-down menu.

This indicates the extension installation was successful.

Installing T2I adapters

T2I adapters are neural network models for providing extra controls to image generations of diffusion models. They are conceptually similar to ControlNet but with a different design.

T2I adapters model (source).

The A1111 ControlNet extension can use T2I adapters. You will need to download the models here. Grab the ones with file names that read like t2iadapter_XXXXX.pth

The functionalities of many of the T2I adapters overlap with ControlNet models. I will only cover the following two.

Put them in ControlNet’s model folder.

stable-diffusion-webuiextensionssd-webui-controlnetmodels

Updating the ControlNet extension

ControlNet is an extension that has undergone rapid development. It is not uncommon to find out your copy of ControlNet is outdated.

Updating is needed only if you run AUTOMATIC1111 locally on Windows or Mac. The Colab notebook of the site always runs the latest of the ControlNet extension.

To determine if your ControlNet version is up-to-date, compare your version number in the ControlNet section on the txt2img page with the latest version number.

Option 1: Update from Web-UI

The easiest way to update the ControlNet extension is using the AUTOMATIC1111 GUI.

  1. Go to the Extensions page.
  2. In the Installed tab, click Check for updates.
  3. Wait for the confirmation message.
  4. Completely close and restart AUTOMATIC1111 Web-UI.

Option 2: Command line

If you are comfortable with the command line, you can use this option to update ControlNet, which gives you the comfort of mind that the Web-UI is not doing something else.

Step 1: Open the Terminal App (Mac) or the PowerShell App (Windows).

Step 2: Navigate to ControlNet extension’s folder. (Adjust accordingly if you installed somewhere else)

cd stable-diffusion-webui/extensions/sd-webui-controlnet

Step 3: Update the extension by running the following command.

git pull

Using ControlNet – a simple example

Now you have ControlNet installed, let’s go through a simple example of using it! You will see a detailed explanation of each setting later.

You should have the ControlNet extension installed to follow this section. You can verify by seeing the ControlNet section below.

Press the caret on the right to expand the ControlNet panel. It shows the full section of control knobs and an image upload canvas.

I will use the following image to show you how to use ControlNet. You can download the image using the download button to follow the tutorial.

Text-to-image settings

ControlNet will need to be used with a Stable Diffusion model. In the Stable Diffusion checkpoint dropdown menu, select the model you want to use with ControlNet. Select v1-5-pruned-emaonly.ckpt to use the v1.5 base model.

In the txt2image tab, write a prompt and, optionally, a negative prompt to be used by ControlNet. I will use the prompts below.

Prompt:

full-body, a young female, highlights in hair, dancing outside a restaurant, brown eyes, wearing jeans

Negative prompt:

disfigured, ugly, bad, immature

Set image size for image generation. I will use width 512 and height 776 for my demo image. Note that the image size is set in the txt2img section, NOT in the ControlNet section.

The GUI should look below.

ControlNet settings

Now let’s move on to the ControlNet panel.

First, upload an image to the image canvas.

Check the Enable checkbox.

You will need to select a preprocessor and a model. Preprocessor is just a different name for the annotator mentioned earlier, such as the OpenPose keypoint detector. Let’s select openpose as Preprocessor.

The selected ControlNet model has to be consistent with the preprocessor. For OpenPose, you should select control_openpose-fp16 as the model.

The ControlNet panel should look like this.

That’s all. Now press Generate to start generating images using ControlNet.

You should see the images generated to follow the pose of the input image. The last image is straightly from the preprocessing step. In this case, it is the keypoints detected.

When you are done, uncheck the Enable checkbox to disable the ControlNet extension.

This is the basics of using ControlNet!

The rest is to understand

  • What preprocessors are available (There are many!)
  • ControlNet settings

Preprocessors and models

The first step of using ControlNet is to choose a preprocessor. It is helpful to turn on the preview so that you know what the preprocessor is doing. Once the preprocessing is done, the original image is discarded, and only the preprocessed image will be used for ControlNet.

To turn on the preview:

  1. Select Allow Preview.
  2. Optionally select Pixel Perfect. ControlNet will use the image height and width you specified in text-to-image to generate the preprocessed image.
  3. Click on the explosion icon next to the Preprocessor dropdown menu.

Some Control models may affect the image too much. Reduce the Control Weight if you see color issues or other artifacts.

Choosing the right model

Once you choose a preprocessor, you must pick the correct model.

It is easy to tell which is the correct model to use in v1.1. All you need to do is to select the model with the same starting keyword as the preprocessor.

For example:

Preprocessor Model
depth_xxxx control_xxxx_depth
lineart_xxxx control_xxxx_lineart
openpose_xxxx control_xxxx_openpose

OpenPose

There are multiple OpenPose preprocessors.

OpenPose detects human key points such as positions of the head, shoulders, hands, etc. It is useful for copying human poses without copying other details like outfits, hairstyles, and backgrounds.

All openpose preprocessors need to be used with the openpose model in ControlNet’s Model dropdown menu.

The OpenPose preprocessors are:

  • OpenPose: eyes, nose, eyes, neck, shoulder, elbow, wrist, knees, and ankles.
  • OpenPose_face: OpenPose + facial details
  • OpenPose_hand: OpenPose + hands and fingers
  • OpenPose_faceonly: facial details only
  • OpenPose_full: All of the above
  • dw_openPose_full: An enhanced version of OpenPose_full

Pro tip: Use Dw OpenPose to extract all details.

OpenPose

OpenPose is the basic OpenPose preprocessor that detects the positions of the eyes, nose, eyes, neck, shoulder, elbow, wrist, knees, and ankles.

OpenPose_face

OpenPose_face does everything the OpenPose processor does but detects additional facial details.

It is useful for copying the facial expression.

Sample images:

OpenPose_faceonly

OpenPose face only detects only the face but not other keypoints. This is useful for copying the face only but not other keypoints.

See samples from text-to-image below. The body is not constrained.

OpenPose_hand

OpenPose_hand detects the keypoint as OpenPose and the hands and fingers.

Sample images:

OpenPose_full

OpenPose full detects everything openPose face and openPose hand do.

Sample images:

dw_openpose_full

DWPose is a new pose detection algorithm based on the research article Effective Whole-body Pose Estimation with Two-stages Distillation. It accomplishes the same task as OpenPose Full but does a better job. You should use dw_openpose_full instead of openpose_full.

Update ControlNet if you don’t see dw_openpose_full in the preprocessor menu.

Reference image for OpenPose and DW OpenPose.
OpenPose Full
DW OpenPose Full

Tile resample

The Tile resample model is used for adding details to an image. It is often used with an upscaler to enlarge an image at the same time.

See the ControlNet Tile Upscaling method.

Reference

Reference is a set of preprocessors that lets you generate images similar to the reference image. The Stable Diffusion model and the prompt will still influence the images.

Reference preprocessors do NOT use a control model. You only need to select the preprocessor but not the model. (In fact, the model dropdown menu will be hidden after selecting a reference preprocessor.)

There are 3 reference preprocessors.

  1. Reference adain: Style transfer via Adaptive Instance Normalization. (paper)
  2. Reference only: Link the reference image directly to the attention layers.
  3. Reference adain+attn: Combination of above.

Select one of these preprocessors to use.

Below is an example.

Reference image (Input).

Using CLIP interrogator to guess the prompt.

a woman with pink hair and a robot suit on, with a sci – fi, Artgerm, cyberpunk style, cyberpunk art, retrofuturism

disfigured, ugly, bad, immature

Model: Protogen v2.2

Reference adain

Reference only

Reference adain+attn

I would say reference-only works best if you twist my arm.

The above images are all from the balance mode. I don’t see a big difference in changing the style fidelity.

Image Prompt adapter (IP-adapter)

An Image Prompt adapter (IP-adapter) is a ControlNet model that allows you to use an image as a prompt. Read the article IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models by He Ye and coworkers and visit their Github page for implementation details.

Install IP-adapter models

Before using the IP adapters in ControlNet, download the IP-adapter models for the v1.5 model.

Put them in ControlNet’s model folder.

stable-diffusion-webui > extensions > sd-webui-controlnet > models

Using IP-adapter

The IP-adapter allows you to use an image as a prompt, so you will need to supply a reference image. Let’s use the following image.

Reference image for image prompt.

In the ControlNet section, upload the image to the image canvas.

Here are the rest of the ControlNet settings to use IP-adapter

  • Enable: Yes
  • Pixel Perfect: Yes
  • Control Type: IP-Adapter
  • Preprocessor: ip-adapter_clip_sd15
  • Model: ip-adapter_sd15

Below are images with and without the IP-adapters.

Without IP-adapter.
With IP-adapter SD1.5

See the features of the reference image like the flowers and darker colors transferred to the generated image!

The SD 1.5 Plus IP-Adapter model does something similar but exerts a stronger effect.

Without IP-adapter.
With IP-adapter SD1.5 Plus

The SD1.5 Plus model is very strong. It will almost copy the reference image. You can reduce the Control Weight to tune it down.

Control Weight: 1
Control Weight: 0.5
Control Weight: 0.2

Canny

Canny edge detector is a general-purpose, old-school edge detector. It extracts the outlines of an image. It is useful for retaining the composition of the original image.

Select canny in both Preprocessor and Model dropdown menus to use.

The generated images will follow the outlines.

Depth

The depth preprocessor guesses the depth information from the reference image.

  • Depth Midas: A classic depth estimator. Also used in the Official v2 depth-to-image model.
  • Depth Leres: More details but also tend to render background.
  • Depth Leres++: Even more details.
  • Zoe: The level of detail sits between Midas and Leres.
  • Depth Anything: A newer and enhanced depth model.
  • Depth Hand Refiner: For fixing hands in inpainting.

Reference Image:

Depth maps:

Midas
Leres
Leres++
Zoe

Prompt and negative prompt:

a woman retrofuturism

disfigured, ugly, bad, immature

You can see the generated image follows the depth map (Zoe).

Text-to-image with Depth Zoe.

Compare with the more detailed Leres++:

Text-to-image with Depth Leres.

Line Art

Line Art renders the outline of an image. It attempts to convert it to a simple drawing.

There are a few line art preprocessors.

  • Line art anime: Anime-style lines
  • Line art anime denoise: Anime-style lines with fewer details.
  • Line art realistic: Realistic-style lines.
  • Line art coarse: Realistic-style lines with heavier weight.

Use with the lineart control model.

Images below are generated with Control Weight setting to 0.7.

Line Art Anime

Line Art Anime Denoise

Line Art Realistic

Line Art Coarse

MLSD

M-LSD (Mobile Line Segment Detection) is a straight-line detector. It is useful for extracting outlines with straight edges like interior designs, buildings, street scenes, picture frames, and paper edges.

Curves will be ignored.

Normal maps

A normal map specifies the orientation of a surface. For ControlNet, it is an image that specifies the orientation of the surface each pixel rests on. Instead of color values, the image pixels represent the direction a surface is facing.

The usage of normal maps is similar to the depth maps. They are used to transfer the 3D composition of the reference image.

Normal map preprocessors:

  • Normal Midas: Estimate the normal map from the Midas depth map.
  • Normal Bae: Estimate the normal map using the normal uncertainty method proposed by Bae et al.

Normal Midas

Like the Midas depth map, the Midas normal map is good for isolating subject from the background.

Normal Bae

The Bae normal map tends to render details in both background and foreground.

Scribbles

Scribble preprocessors turn a picture into a scribble, like those drawn by hand.

  • Scribble HED: Holistically-Nested Edge Detection (HED) is an edge detector good at producing outlines like an actual person would. According to ControlNet’s authors, HED is suitable for recoloring and restyling an image.
  • Scribble Pidinet: Pixel Difference network (Pidinet) detects curves and straight edges. Its result is similar to HED but usually results in cleaner lines with fewer details.
  • Scribble xdog: EXtended Difference of Gaussian (XDoG) is an edge detection method technique. It is important to adjust the xDoG threshold and observe the preprocessor output.

All these preprocessors should be used with the scribble control model.

Scribble HED

HED produces coarse scribble lines.

Scribble Pidinet

Pidinet tends to produce coarse lines with little detail. It’s good for copying the board outline without fine details.

Scribble xDoG

The level of detail is controllable by adjusting the XDoG threshold, making xDoG a versatile preprocessor for creating scribbles.

Segmentation

Segmentation preprocessors label what kind of objects are in the reference image.

Below is a segmentation processor in action.

The buildings, sky, trees, people, and sidewalks are labeled with different and predefined colors.

You can find the object categories and colors in the color map here for ufade20k and ofade20k.

There are a few segmentation options

  • ufade20k: UniFormer (uf) segmentation trained on ADE20K dataset.
  • ofade20k: OneFormer (of) segmentation trained on ADE20k dataset.
  • ofcoco: OnFormer segmentation trained on COCO dataset.

Note that the color maps of ADE20k and COCO segmentations are different.

You can use segmentation preprocessors to transfer the location and shape of objects.

Below are using these preprocessors with the same prompt and seed.

Futuristic city, tree, buildings, cyberpunk

UniFormer ADE20k (ufade20k)

Uniformer labels everything accurately in this example.

OneFormer ADE20k (ofade20k)

OneFormer is a bit more noisy in this case, but doesn’t affect the final image.

OneFormer COCO (ofcoco)

OneFormer COCO performs similarly, with some mislabels.

Segmentation is a powerful technique. You can further manipulate the segmentation map to put objects at precise locations. Use the color map for ADE20k.

Shuffle

The Shuffle preprocessor stirs up the input image. Let’s see the shuffle in action.

Together with the Shuffle control model, the Shuffle preprocessor can be used for transferring the color scheme of the reference image.

Input image:

Shuffle preprocessor:

Unlike other preprocessors, the Shuffle preprocessor is randomized. It will be affected by your seed value.

Use the Shuffle preprocessor with the Shuffle control model. The Shuffle control model can be used with or without the Shuffle preprocessor.

The image below is with ControlNet Shuffle preprocessor and Shuffle model (Same prompt as the last section). The color scheme roughly follows the reference image.

The image below is with the ControlNet Shuffle model only (Preprocessor: None). The image composition is closer to the original. The color scheme is similar to the shuffled.

For your reference, the image below is with the same prompt without ControlNet. The color scheme is drastically different.

Color grid T2I adapter

Color grid T2i adapter preprocessor shrinks the reference image to 64 times smaller and then expands it back to the original size. The net effect is a grid-like patch of local average colors.

Original reference image:

Preprocessed with t2ia_color_grid:

The preprocessed image can then be used with the T2I color adapter (t2iadapter_color) control model.

The image generation will loosely follow the color scheme spatially.

A modern living room

Increase the ControlNet weight to have it follow more closely.

You can also use preprocessor None for this T2I color model.

In my opinion, it is pretty similar to image-to-image.

Clip vision style T2I adapter

t2ia_style_clipvision converts the reference image to the CLIP vision embedding. This embedding contains rich information on the image’s content and style.

You will need to use the Control model t2iadapter_style_XXXX.

See this amazing style transfer in action:

Reference image:

T2I adapter – CLIP vision:

sci-fi girl

Below are what this prompt would generate if you turn the ControlNet off.

The function is pretty similar to Reference ControlNet, but I would rate T2IA CLIP vision higher.

ControlNet Inpainting

ControlNet inpainting lets you use high denoising strength in inpainting to generate large variations without sacrificing consistency with the picture as a whole.

For example, I used the prompt for realistic people.

Model: HenmixReal v4

photo of young woman, highlight hair, sitting outside restaurant, wearing dress, rim lighting, studio lighting, looking at the camera, dslr, ultra quality, sharp focus, tack sharp, dof, film grain, Fujifilm XT3, crystal clear, 8K UHD, highly detailed glossy eyes, high detailed skin, skin pores

Negative prompt

disfigured, ugly, bad, immature, cartoon, anime, 3d, painting, b&w

I have this image and want to regenerate the face with inpainting.

If I inpaint the face with a high denoising strength (> 0.4), the result will likely to be globally inconsistent. Below are the unpainted images with denoising strength 1.

ControlNet Inpainting is your solution.

To use ControlNet inpainting:

  1. It is best to use the same model that generates the image. After generating an image on the txt2img page, click Send to Inpaint to send the image to the Inpaint tab on the Img2img page.

2. Use the paintbrush tool to create a mask over the area you want to regenerate. See the beginner’s tutorial on inpainting if you are unfamiliar with it.

3. Set Inpaint area to Only masked. (Whole picture also works)

4. Set denoising strength to 1. (You won’t normally set this this high without ControlNet.)

5. Set the following parameters in the ControlNet section. You don’t need to upload a reference image.

Enable: Yes

Preprocessor: Inpaint_global_harmonious

Model: ControlNet

6. Press Generate to start inpainting.

Now I get new faces consistent with the global image, even at the maximum denoising strength (1)!

Currently, there are 3 inpainting preprocessors

  • Inpaint_global_harmonious: Improve global consistency and allow you to use high denoising strength.
  • Inpaint_only: Won’t change unmasked area. It is the same as Inpaint_global_harmonious in AUTOMATIC1111.
  • Inpaint_only+lama: Process the image with the lama model. It tends to produce cleaner results and is good for object removal.
Original image
inpaint+lama

Copying a face with ControlNet

You can use a special IP-adapter face model to generate consistent faces across multiple images.

Installing the IP-adapter plus face model

  1. Make sure your A1111 WebUI and the ControlNet extension are up-to-date.

2. Download the ip-adapter-plus-face_sd15.bin and put it in stable-diffusion-webui > models > ControlNet.

3. Rename the file’s extension from .bin to .pth. (i.e., The file name should be ip-adapter-plus-face_sd15.pth)

Using the IP-adapter plus face model

To use the IP adapter face model to copy a face, go to the ControlNet section and upload a headshot image.

Important ControlNet Settings:

  • Enable: Yes
  • Preprocessor: ip-adapter_clip_sd15
  • Model: ip-adapter-plus-face_sd15

The control weight should be around 1. You can use multiple IP-adapter face ControlNets. Make sure to adjust the control weights accordingly so that they sum up to 1.

With the prompt:

A woman sitting outside of a restaurant in casual dress

Negative prompt:

ugly, deformed, nsfw, disfigured

You get:

Consistent face with multiple IP-adapter face ControlNets.

ALL ControlNet settings explained

You see a lot of settings in the ControlNet extension! It can be a bit intimidating when you first use it, but let’s go through them one by one.

It’s going to be a deep dive. Take a break and go to the bathroom if you need to…

Input controls

Image Canvas: You can drag and drop the input image here. You can also click on the canvas and select a file using the file browser. The input image will be processed by the selected preprocessor in the Preprocessor dropdown menu. A control map will be created.

Write icon: Create a new canvas with a white image instead of uploading a reference image. It is for creating a scribble directly.

Camera icon: Take a picture using your device’s camera and use it as the input image. You will need to grant permission to your browser to access the camera.

Model selection

Enable: Whether to enable ControlNet.

Low VRAM: For GPU with less than 8GB VRAM. It is an experimental feature. Check if you are out of GPU memory, or want to increase the number of images processed.

Allow Preview: Check this to enable a preview window next to the reference image. I recommend you to select this option. Use the explosion icon next to the Preprocessor dropdown menu to preview the effect of the preprocessor.

Preprocessor: The preprocessor (called annotator in the research article) for preprocessing the input image, such as detecting edges, depth, and normal maps. None uses the input image as the control map.

Model: ControlNet model to use. If you have selected a preprocessor, you would normally select the corresponding model. The ControlNet model is used together with the Stable Diffusion model selected at the top of AUTOMATIC1111 GUI.

Control Weight

Below the preprocessor and model dropdown menus, you will see three sliding bars to let you dial in the Control effect: Control Weight, Starting and ending Control Steps.

I will use the following image to illustrate the effect of control weight. It’s an image of a girl sitting down.

But in the prompt, I will ask to generate a woman standing up.

full body, a young female, highlights in hair, standing outside restaurant, blue eyes, wearing a dress, side light

Weight: How much emphasis to give the control map relative to the prompt. It is similar to keyword weight in the prompt but applies to the control map.

The following images are generated using ControlNet OpenPose preprocessor and with the OpenPose model.

As you can see, Controlnet weight controls how much the control map is followed relative to the prompt. The lower the weight, the less ControlNet demands the image to follow the control map.

Starting ControlNet step: The step ControlNet first applies. 0 means the very first step.

Ending ControlNet step: The step ControlNet ends. 1 means the last step.

Let’s fix the starting step fixed at 0 and change the ending ControlNet step to see what happens.

Since the initial steps set the global composition (The sampler removes the maximum amount of noise in each step, and it starts with a random tensor in latent space), the pose is set even if you only apply ControlNet to as few as 20% of the first sampling steps.

In contrast, changing the ending ControlNet step has a smaller effect because the global composition is set in the beginning steps.

Control Mode

Balanced: The ControlNet is applied to both conditioning and unconditoning in a sampling step. This is the standard mode of operation.

My prompt is more important: The effect of ControlNet is gradually reducing over the instances of U-Net injection (There are 13 of them in one sampling step). The net effect is your prompt has more influence than the ControlNet.

ControlNet is more important: Turn off ControlNet on unconditioning. Effectively, the CFG scale also acts as a multiplier for the effect of the ControlNet.

Don’t worry if you don’t fully understand how they actually work. The option labels accurately state the effect.

Resize mode

Resize mode controls what to do when the size of the input image or control map is different from the size of the images to be generated. You don’t need to worry about these options if they are in the same aspect ratio.

I will demonstrate the effect of resize modes by setting text-to-image to generate a landscape image, while the input image/control map is a portrait image.

Just Resize: Scale the width and height of the control map independently to fit the image canvas. This will change the aspect ratio of the control map.

The girl now needs to lean forward so that she’s still within the canvas. You can create some interesting effect with this mode.

Crop and Resize: Fits the image canvas to be within the control map. Crop the control map so that it is the same size as the canvas.

Because the control map is cropped at the top and the bottom, so does our girl.

Resize and fill: Fit the whole control map to the image canvas. Extend the control map with empty values so that it is the same size as the image canvas.

Compared to the original input image, there are more spaces on the side.

OK, now (hopefully) you know all the settings. Let’s explore some ideas for using ControlNet.

Multiple ControlNets

You can use ControlNets multiple times to generate an image. Let’s walk through an example.

Model: Protogen v2.2

Prompt:

An astronaut sitting, alien planet

Negative prompt:

disfigured, deformed, ugly

This prompt generates images with varieties of composition.

Let’s say I want to control the composition of the astronaut and background independently. We can use multiple (in this case 2) ControlNets for this.

I will use this reference image for fixing the pose of the astronaut.

Settings for ControlNet 0:

  • Enable: Yes
  • Preprocessor: OpenPose
  • Model: control_xxxx_openpose
  • Resize mode: Resize and Refill (Since my original reference image is in portrait)

I will use the following reference image for the background.

The depth models are perfect for this purpose. You will want to play with which depth model and setting gives the depth map you want.

Settings for ControlNet 1:

  • Enable: Yes
  • Control Weight: 0.45
  • Preprocessor: depth_zeo
  • Model: control_XXXX_depth
  • Resize mode: Crop and resize

Now I can control the composition of the subject and the background independently:

Tips:

  • Adjust ControlNet weights if one of them does not do its job.
  • Pay attention to the resize mode if you have reference images of different sizes of the final image.

Some ideas for using ControlNet

Copying human pose

Perhaps the most common application of ControlNet is copying human poses. This is because it is usually hard to control poses… until now! The input image can be an image generated by Stable Diffusion or can be taken from a real camera.

OpenPose model

To use ControlNet for transferring human poses, follow the instructions to enable ControlNet in AUTOMATIC1111. Use the following settings.

Preprocessor: openpose

Model: control_…._openpose

Make sure you have checked Enable.

Here are a few examples.

Example 1: Copying pose from an image

As a basic example, let’s copy the pose of the following image of a woman admiring leaves.

Input image

Using various models and prompts, you can dramatically change the content but keep the pose.

Example 2: Remix a movie scene

You can recast the iconic dance scene in Pulp Fiction to some yoga exercises in the park.

This uses ControlNet with DreamShaper model.

Prompt: photo of women doing yoga, outside in a park. Negative prompt: disfigured, ugly, bad, immature

This is with the same prompt, but using Inkpunk Diffusion model. (You will need to add the activation keyword nvinkpunk to the prompt)

Same prompt as ink punk diffusion model.

Stylize images with ControlNet

Using prompts

Below are with v1.5 model but various prompts to achieve different styles. ControlNet with various preprocessing was used. It is best to experiment and see which one works best.

Using models

You can also use models to stylize images. Below are generated using the prompt “Painting of Beethoven” with Anythingv3, DreamShaper, and OpenJourney models.

Controlling poses with Magic Pose

Sometimes you may be unable to find an image with the exact pose you want. You can create your custom pose using software tools like Magic Poser (credit).

Step 1: Go to the Magic Poser website.

Step 2: Move the keypoints of the model to customize the pose.

Step 3: Press Preview. Take a screenshot of the model. You should get an image like the one below.

Human pose from Magic Poser.

Step 4: Use OpenPose ControlNet model. Select the model and prompt of your choice to generate images.

Below are some images generated using 1.5 model and DreamShaper model. The pose was copied well in all cases.

Interior design ideas

You can use Stable Diffusion ControlNet’s straight-line detector MLSD model to generate interior design ideas. Below are the ControlNet settings.

Preprocessor: mlsd

Model: mlsd

Start with any interior design photos. Let’s use the one below as an example.

Input image for interior design.

Prompt:

award winning living room

Model: Stable Diffusion v1.5

Below are a few design ideas generated.

Alternatively, you can use the depth model. Instead of straight lines, it will emphasize preserving the depth information.

Preprocessor: Depth Midas

Model: Depth

Generated images:

Difference between the Stable Diffusion depth model and ControlNet

Stability AI, the creator of Stable Diffusion, released a depth-to-image model. It shares a lot of similarities with ControlNet, but there are important differences.

Let’s first talk about what’s similar.

  1. They are both Stable Diffusion models…
  2. They both use two conditionings (a preprocessed image and text prompt).
  3. They both use MiDAS to estimate the depth map.

The differences are

  1. Depth-to-image model is a v2 model. ControlNet can be used with any v1 or v2 models. This point is huge because v2 models are notoriously hard to use. People have a hard time generating good images. The fact that ControlNet can use any v1 model not only opened up depth conditioning to the v1.5 base model, but also thousands of special models that were released by the community.
  2. ControlNet is more versatile. In addition to depth, it can also condition with edge detection, pose detection, and so on.
  3. ControlNet’s depth map has a higher resolution than depth-to-image’s.

How does ControlNet work?

This tutorial won’t be complete without explaining how ControlNet works under the hood.

ControlNet works by attaching trainable network modules to various parts of the U-Net (noise predictor) of the Stable Diffusion Model. The weight of the Stable Diffusion model is locked so that they are unchanged during training. Only the attached modules are modified during training.

The model diagram from the research paper sums it up well. Initially, the weights of the attached network module are all zero, making the new model able to take advantage of the trained and locked model.

During training, two conditionings are supplied along with each training image. (1) The text prompt, and (2) the control map such as OpenPose keypoints or Canny edges. The ControlNet model learns to generate images based on these two inputs.

Each control method is trained independently.

More readings

aizmin: