Comparing Stable Diffusion 3 with SDXL and Stable Cascade

Stable Diffusion 3 is the latest and largest image Stable Diffusion model. It promises to outperform previous models like Stable Cascade and Stable Diffusion XL in text generation and prompt following.

In this post, I will compare the Stable Diffusion 3 model with the Stable Cascade and XL model from a user’s perspective. The following areas are tested:

  • Text rendering
  • Prompt following
  • Rendering Faces
  • Rendering Hands
  • Styles

Using Stable Diffusion 3

Stable Diffusion 3 is currently available through Stability’s developer API. You can find instructions to use it in this post.

Read this post for an overview of the Stable Diffusion 3 model.

Text rendering

Generating legible text on images has long been challenging for AI image generators. Stable Diffusion 1.5 sucks at it. Stable Diffusion XL is an improvement. Stable Cascade is a quantum leap.

The good news is Stable Diffusion 3’s text generation is at the next level.

Let’s test the three models with the following prompt, which intends to generate a challenging text.

a portrait photo of a 25-year old beautiful woman, busy street street, smiling, holding a sign “Stable Diffusion 3 vs Cascade vs SDXL”

Here are images from the SDXL model. You can see the text generation is far from being correct.

The Stable Cascade generates clearer text but the result is far from satisfactory.

Finally, Stable Diffusion 3 performs the best in text generation, representing the best text generation model in the Stable Diffusion series so far.

Ideas for text generation

What can you do with this new capability? You can now experiment with funny signs.

City street, a sign on a property that says “No trespassing. We are tired of hiding the bodies”

We need ControlNet to generate text effect images in Stable Diffusion 1.5 and XL. Not anymore in SD3!

The word “Fire” made of fire and lava, high temperature

Or now you can play with longer texts.

The sentence “Let’s snow somewhere else” made of snow, mountain

Prompt-following

Prompt-following is touted as a major improvement.

I will challenge the prompt-following ability of the three Stable Diffusion models in the following:

  • Controlling poses
  • Object compositions

Controlling poses

Prompt:

Photo of a woman sitting on a chair with both hands above her head, white background

Negative prompt:

disfigured, deformed, ugly, detailed face

Below are the images from the SDXL model.

Below are the images from the Stable Cascade model.

And finally, the Stable Diffusion 3 model.

All three models do pretty well with this human pose. Perhaps it is not challenging enough? Here are some observations.

  • It takes less cherry-picking for SD3 to get good images. (It better be because I am paying for each image generation through the API!)
  • The images from SD3 are strangely uniform – All in an all-black outfit. Perhaps the model is a bit overcooked? I’m sure a fine-tuned model can fix that.
  • To SD3’s credit, its images

Maybe this pose is not challenging enough? Let’s try another one:

Photo of a boy looking up and raise one arm, stand on one foot, white background

SDXL couldn’t generate that exact pose we asked for, but its half correct.

Stable Cascade is more correct. There is a higher chance that the kid is raising one hand.

Stable Diffusion 3 gets the hand part correct most of the time, although it is still not standing on one foot.

Conclusion: In many cases, the accuracy of human poses of Stable Diffusion 3 is similar to SDXL and Cascade. For challenging poses, Stable Diffusion 3 has an edge over the other two.

Object composition

The next test of objection composition: How well does the model follow the object placement.

Prompt:

Still life painting of a skull above a book, with an orange on the right and an apple on the left

SD3 is the clear winner in this test. The composition is correct in every image I generated.

SDXL: (None is correct)

Stable Cascade: (1 out of 3 is correct)

Stable Diffusion 3: (All are correct)

This is an expected improvement as newer models like DALLE3 have used highly accurate captions in training to significantly improve prompt-following. From SD3’s user testing, the prompt-following of SD3 is on par with DALLE3.

How can you take advantage of it? In addition to just using it, you can generate a template image to be used with ControlNet with SD 1.5 and SDXL models.

Hands

Here’s the prompt I used:

photo of open palms, detailed fingers, city street

I don’t see any clear improvement in generating hands. I hope to see a focused effort from Stability to add high-quality photos of hands to the training data.

SDXL:

Stable Cascade:

SD3:

Faces

Generating faces is perhaps the most popular application of AI image generators. Let’s test with the following prompt.

photo of a 20 year old korean k-pop star, beautiful woman, detailed face, eyes, lips, nose, hair, realistic skin tone

I added keywords “3D” and “cartoon” to the negative prompt to enhance the realistic style:

disfigured, deformed, ugly, 3d, cartoon

SDXL:

Stable Cascade:

SD3:

All models performed pretty well with no serious defects. They have different default styles and exposure levels. You should be able to customize it with more specific prompt and negative prompt.

Styles

AI models can generate a lot more styles than I can test. I will pick a few prompts from the SDXL style reference.

Expressionist

Prompt

expressionist woman. raw, emotional, dynamic, distortion for emotional effect, vibrant, use of unusual colors, detailed

Negative prompt

anime, photorealistic, 35mm film, deformed, glitch, low contrast, noisy

SD3 appropriately uses bold colors for this prompt. I would rate its generation more faithful to the prompt.

Typography art

Prompt:

typographic art woman . stylized, intricate, detailed, artistic, text-based

Negative prompt:

ugly, deformed, noisy, blurry, low contrast, realism, photorealistic

I’m pleasantly surprised by how well SD3 blends the text and image with pleasing details.

RPG Game

Prompt:

role-playing game (RPG) style fantasy city . detailed, vibrant, immersive, reminiscent of high fantasy RPG games

Negative prompt:

sci-fi, modern, urban, futuristic, low detailed

All models generate reasonable video game images. Stable Cascade’s default style is a newer 3D game style, while SDXL and SD3 create a style of older video games.

Overall, I really like SD3’s ability to render styles accurately. There are real opportunities to mix and match different styles to generate something new!

Conclusions

Stable Diffusion 3 delivers noted improvements in rendering text and generating images that closely follow the prompt. While SD3 is still imperfect in directing human poses with prompts, it has improved over the previous most competent models.

Face generations are generally on par with previous models, which have already been excellent.

Unfortunately, SD3 still has problems in generating hands. It didn’t seem to have improved over the previous models. Hopefully, there will be some more focused effort in this area.

Generating new styles is an excellent opportunity for SD3. With its enhanced capability in text rendering and prompt-following, we can more accurately dial in the styles we want.

Related Posts