You can now run the Stable Diffusion 3 Medium model locally on your machine. As of the time of writing, you can use ComfyUI to run SD 3 Medium.
Here’s the video version of this tutorial.
Software
We will use ComfyUI, an alternative to AUTOMATIC1111.
Read the ComfyUI installation guide and ComfyUI beginner’s guide if you are new to ComfyUI.
Take the ComfyUI course to learn ComfyUI step-by-step.
Systems requirement
You need a GPU card with 12 GB to use the full SD3 medium model. A smaller variant of SD3 Medium (without T5XXL) requires 8 GB VRAM.
Model
The Stable Diffusion 3 Medium model is the so-called 2B model. (ie. 2 billion parameters).
The SD 3 Medium model is different from the model accessible through Stable Diffusion 3 API, which is likely to be the 8B Large model. The SD3 Medium model described in the article is less capable.
See an overview of Stable Diffusion 3 if you are unfamiliar with the model.
License concern
The SD3 model is free for non-commercial use. There are some concerns about what constitutes commercial use. Does using an image generated with the SD3 model in a book you sell count as commercial use?
Disclaimer: I’m not a lawyer. Below is my read of the license file.
The license limits the usage of the model and its “derivative works”.
You may not use the Software Products or Derivative Works to enable third parties to use the Software Products or Derivative Works as part of your hosted service or via your APIs,
So, what are the derivative works? The license clarifies that it does NOT include the images generated by the model.
For clarity, Derivative Works do not include the output of any Model.
In the context of Stable Diffusion, Derivative works mean fine-tuned models.
The license explicitly forbids you from hosting an image-generation service without obtaining a commercial license from them. It should be OK to use the images generated by the model in any way you want as long as you compile with their Acceptable Use Policy.
Stability’s subscription page is less clear. The name Creator License seems to suggest that you should buy the license ($20 per month) if you are an artist or a creator of social media.
Stability should clarify the license issue ASAP given that it relies so much on the user community for its success.
ComfyUI
Step 1: Update ComfyUI
The easiest way to update ComfyUI is to use ComfyUI Manager.
Select Manager > Update ComfyUI.
Step 2: Download SD3 model
Download the SD3 model.
- SD 3 Medium (10.1 GB) (12 GB VRAM) (Alternative download link)
- SD 3 Medium without T5XXL (5.6 GB) (8 GB VRAM) (Alternative download link)
Put it in ComfyUI > models > checkpoints.
Step 3: Load the workflow
Download the workflow JSON file below and drop it in ComfyUI.
Step 4: Select a model and generate an image
In the LoadCheckpoint node, Select
- stableDiffusion3SD3_sd3MediumInclT5XXL for the full model. (12 GB VRAM)
- stableDiffusion3SD3_sd3MediumInclClips for the model without T5XXL. (8 GB VRAM)
Click Queue Prompt to generate an image.
Image sizes
Here is a list of aspect ratios and image size:
1:1 – 1024 x 1024
5:4 – 1152 x 896
3:2 – 1216 x 832
16:9 – 1344 x 768
21:9 – 1536 x 640
Comparisons
Here’s a first look at the models’ performance.
Text generation
Generating legible text is a big improvement in the Stable Diffusion 3 API model. Let’s see if the locally-run SD 3 Medium performs equally well.
Prompt:
The words “Stable Diffusion 3 Medium” made with fire and lava. dimly lit background with rocks
Negative Prompt:
disfigured, deformed, ugly
Stable Diffusion 3 Medium:
Stable Diffusion 3 Medium without T5XXL.
Stable Diffusion 3 API:
Unfortunately, the SD 3 Medium model did not generate text as well as the Stable Diffusion 3 API model, which is likely the Large 8B model.
Controlling poses
Stable Diffusion 3 Medium has issues with human anatomy. See the following comparison between SD3 Medium, SDXL, and SD3 API (Large).
Prompt:
Photo of a woman sitting on a chair with both hands above her head, white background
Negative prompt:
disfigured, deformed, ugly, detailed face
Stable Diffusion 3 Medium:
Stable Diffusion 3 Medium without T5XXL:
Below are the images from the SDXL model.
Stable Diffusion 3 API (Large):
Overall, Stable Diffusion 3 Medium’s capability in generating correct human pose is worse than SDXL.
However, the SD3 Medium model is not too bad at generating fingers! This is a nice surprise.
Photo of a woman showing her palm, new york city background
Prompt adherence
Let’s test if the model can accurately follow the prompt. I will use the following prompt.
Still life painting of a skull above a book, with an orange on the right and an apple on the left
SD3 Medium: 2 out of 3 is correct
SD3 Medium (without T5XXL): 1 out of 3 is correct
SDXL: None is correct
Stable Diffusion 3 API (Large): All are correct
Here’s another example of excellent prompt adherence.
a man and woman are standing together gains a brick wall. The left side of the brick wall is red, right side is gold. the woman is wearing a t-shirt with a panda motif, she has a long skirt with birds on it, the man is wearing a silver suit, he has spiky red hair
I’m pleasantly surprised that SD 3 medium follows the prompt well and outperforms SDXL. At least something is moving in the right direction!
Conclusion
SD 3 Medium excels in following the prompt closely, which is a big improvement over the SDXL model. While it is a bit disappointing to generate text and human anatomy, these defects can likely be corrected by further fine-tuning and the use of the SD 3 Large model.