Enhanced 3D Diffusion Algorithm

Stable Diffusion 3 is announced, followed by a research paper detailing the model. The model is not publicly available yet, but you can join the waitlist to get in line for an early preview. I will add the usage guide to this article once they are available.

But in this article, we will take a closer look at

  • What is Stable Diffusion 3
  • The improvements in Stable Diffusion 3
  • Model changes
  • Sample images

What is the Stable Diffusion 3 model?

Stable Diffusion 3 is the latest generation of text-to-image AI models to be released by Stability AI.

It is not a single model but a family of models ranging from 800M to 8B parameters. In other words, the smallest model is a bit smaller than Stable Diffusion 1.5 (1B), and the largest model is a bit bigger than the Stable Diffusion XL model (6.6B for base + refiner).

The product design follows the industrial trend of large language models, where Google, Meta, and Antropic have released foundation models of different sizes for different use cases.

Stable Diffusion 3 allows commercial use, but it is not free. You must join their membership program for a fee. (Currently quite modest for small businesses)

Improvements

So what’s the big deal with Stable Diffusion 3? Here are some expected improvements.

Better text generation

Text rendering has long been a weakness for Stable Diffusion. Stable Diffusion 1.5 is really bad at it.

We see significant improvements in Stable Diffusion XL and Stable Cascade models but they are still hit or miss and need significant cherry-picking. And it is hopeless for long sentences.

See what SD 3 can generate:

Stable Diffusion 3 image (Stability AI)

From the sample images, we see some impressive long sentences with good font styles. I definitely think it is an improvement because we can never generate those images with SDXL or Stable Cascade.

Better prompt following

An outstanding issue with SDXL and Stable Cascade is that they do not follow the prompts as well as DALLE 3. One innovation of DALLE 3 is using highly accurate image captions in training to learn to follow prompts well.

I previously speculated that a new version of Stable Diffusion could use the same method to improve the model. Lo and behold, they did that in SD 3.

Stable Diffusion 3 is on par with DALLE 3 on prompt-following in user studies. (Stability AI)

Stable Diffusion 3 should be at least as good as DALLE 3 in the following prompts. This is going to be exciting.

Speed and deployment

You will be able to run the largest SD3 locally if you have a video card with 24 GB RAM. The requirement will likely come down when it is released, and people will start to do all kinds of optimizations on consumer PCs.

The initial benchmark is 34 seconds for a 1024×1024 image on the RTX 4090 video card (50 steps). We should have a lot of room for improvement.

Stable Diffusion 3 image (Stability AI)

Safety…

Similar to newer Stable Diffusion models, Stable Diffusion 3 is likely to generate SFW images only.

In addition, Artists who did not want their work to be in the model could opt out. While this reduces the styles available for mix-and-match, it should make the model less prone to misuse.

I’m not sure what they are doing with Deep Fake. Fictitious images of celebrities arguably do the most harm in spreading misinformation in the age of fast media consumption. DALLE 3 stays out of trouble by being not so good at generating photorealistic images. Stable Diffusion has been good at photorealistic images. I hope they won’t steer away from it.

Stable Diffusion 3 image (Stability AI)

What’s new in Stable Diffusion 3 model

Noise predictor

A notable change in Stable Diffusion 3 is the departure from the U-Net noise predictor architecture used in Stable Diffuson 1 and 2.

Stable Diffusion 3 uses a repeating stack of Diffusion Transformers. The benefit is similar to using transformers in large language models: You get predictable performance improvement as you increase the model size.

A Diffusion Transformer block (image: Stability AI)

The block has an interesting structure that puts the text prompt and the latent image on the same footing. Looks like this architecture is well-positioned to add multimodal conditionings like image prompts.

Sampling

The Stability team has spent considerable effort studying sampling to make it fast and high-quality. Stable Diffusion 3 uses Retified Flow sampling. Basically, it is a straight path from noise to clear image—the fastest way to get there.

The team also found a noise schedule that sampled the middle part of the path more produced higher-quality images.

Well, looks like the sampling is going to be completely different. But from our experience, some existing samplers likely work well enough.

Text encoders

Stable Diffusion 1 uses 1 text encoder (CLIP), while Stable Diffusion XL uses 2 encoders (CLIP and OpenCLIP).

Guess what? Stable Diffusion 3 uses 3 encoders!

  • OpenAI’s CLIP L/14
  • OpenCLIP bigG/14
  • T5-v1.1-XXL

The last one is pretty big and can be dropped if you are not generating text.

Better captions

One thing that DALLE 3 did was use highly accurate captions in training. That’s why it follows the prompt so well.

Stable Diffusion 3 also did that. You can expect good prompt-following like DALLE 3.

Reference

aizmin: