Running Stable Diffusion 3 Locally: A Comprehensive Guide

You can now run the Stable Diffusion 3 Medium model locally on your machine. As of the time of writing, you can use ComfyUI to run SD 3 Medium.

Here’s the video version of this tutorial.

Software

We will use ComfyUI, an alternative to AUTOMATIC1111.

Read the ComfyUI installation guide and ComfyUI beginner’s guide if you are new to ComfyUI.

Take the ComfyUI course to learn ComfyUI step-by-step.

Systems requirement

You need a GPU card with 12 GB to use the full SD3 medium model. A smaller variant of SD3 Medium (without T5XXL) requires 8 GB VRAM.

Model

The Stable Diffusion 3 Medium model is the so-called 2B model. (ie. 2 billion parameters).

The SD 3 Medium model is different from the model accessible through Stable Diffusion 3 API, which is likely to be the 8B Large model. The SD3 Medium model described in the article is less capable.

See an overview of Stable Diffusion 3 if you are unfamiliar with the model.

License concern

The SD3 model is free for non-commercial use. There are some concerns about what constitutes commercial use. Does using an image generated with the SD3 model in a book you sell count as commercial use?

Disclaimer: I’m not a lawyer. Below is my read of the license file.

The license limits the usage of the model and its “derivative works”.

You may not use the Software Products or Derivative Works to enable third parties to use the Software Products or Derivative Works as part of your hosted service or via your APIs,

So, what are the derivative works? The license clarifies that it does NOT include the images generated by the model.

For clarity, Derivative Works do not include the output of any Model.

In the context of Stable Diffusion, Derivative works mean fine-tuned models.

The license explicitly forbids you from hosting an image-generation service without obtaining a commercial license from them. It should be OK to use the images generated by the model in any way you want as long as you compile with their Acceptable Use Policy.

Stability’s subscription page is less clear. The name Creator License seems to suggest that you should buy the license ($20 per month) if you are an artist or a creator of social media.

Stability should clarify the license issue ASAP given that it relies so much on the user community for its success.

ComfyUI

Step 1: Update ComfyUI

The easiest way to update ComfyUI is to use ComfyUI Manager.

Select Manager > Update ComfyUI.

Step 2: Download SD3 model

Download the SD3 model.

Put it in ComfyUI > models > checkpoints.

Step 3: Load the workflow

Download the workflow JSON file below and drop it in ComfyUI.

Step 4: Select a model and generate an image

In the LoadCheckpoint node, Select

  • stableDiffusion3SD3_sd3MediumInclT5XXL for the full model. (12 GB VRAM)
  • stableDiffusion3SD3_sd3MediumInclClips for the model without T5XXL. (8 GB VRAM)

Click Queue Prompt to generate an image.

Image sizes

Here is a list of aspect ratios and image size:
1:1 – 1024 x 1024
5:4 – 1152 x 896
3:2 – 1216 x 832
16:9 – 1344 x 768
21:9 – 1536 x 640

Comparisons

Here’s a first look at the models’ performance.

Text generation

Generating legible text is a big improvement in the Stable Diffusion 3 API model. Let’s see if the locally-run SD 3 Medium performs equally well.

Prompt:

The words “Stable Diffusion 3 Medium” made with fire and lava. dimly lit background with rocks

Negative Prompt:

disfigured, deformed, ugly

Stable Diffusion 3 Medium:

Stable Diffusion 3 Medium without T5XXL.

Stable Diffusion 3 API:

Unfortunately, the SD 3 Medium model did not generate text as well as the Stable Diffusion 3 API model, which is likely the Large 8B model.

Controlling poses

Stable Diffusion 3 Medium has issues with human anatomy. See the following comparison between SD3 Medium, SDXL, and SD3 API (Large).

Prompt:

Photo of a woman sitting on a chair with both hands above her head, white background

Negative prompt:

disfigured, deformed, ugly, detailed face

Stable Diffusion 3 Medium:

Stable Diffusion 3 Medium without T5XXL:

Below are the images from the SDXL model.

Stable Diffusion 3 API (Large):

Overall, Stable Diffusion 3 Medium’s capability in generating correct human pose is worse than SDXL.

However, the SD3 Medium model is not too bad at generating fingers! This is a nice surprise.

Photo of a woman showing her palm, new york city background

Prompt adherence

Let’s test if the model can accurately follow the prompt. I will use the following prompt.

Still life painting of a skull above a book, with an orange on the right and an apple on the left

SD3 Medium: 2 out of 3 is correct

SD3 Medium (without T5XXL): 1 out of 3 is correct

SDXL: None is correct

Stable Diffusion 3 API (Large): All are correct

Here’s another example of excellent prompt adherence.

a man and woman are standing together gains a brick wall. The left side of the brick wall is red, right side is gold. the woman is wearing a t-shirt with a panda motif, she has a long skirt with birds on it, the man is wearing a silver suit, he has spiky red hair

I’m pleasantly surprised that SD 3 medium follows the prompt well and outperforms SDXL. At least something is moving in the right direction!

Conclusion

SD 3 Medium excels in following the prompt closely, which is a big improvement over the SDXL model. While it is a bit disappointing to generate text and human anatomy, these defects can likely be corrected by further fine-tuning and the use of the SD 3 Large model.

aizmin: