Generating Prompts from Images Using Stable Diffusion

Have you seen an AI image you really liked and wondered what the prompt was? We will go through a few ways to get prompts from images. You will also learn additional techniques to increase your chances of recreating the image.

In this article, you will learn how to

  • Read PNG metadata information, where sometimes the prompt is written.
  • Use CLIP interrogator to guess the prompt.
  • Tips for reproducing an image.

Software setup

We will use AUTOMATIC1111 Stable Diffusion WebUI in this tutorial. It is popular and free. You can use this software on WindowsMac, or Google Colab.

Check out the Quick Start Guide if you are new to Stable Diffusion. Check out the AUTOMATIC1111 Guide if you are new to AUTOMATIC1111.

Method 1: Get prompts from images by reading PNG Info

If the AI image is in PNG format, you can try to see if the prompt and other setting information were written in the PNG metadata field.

First, save the image to your local storage.

Open AUTOMATIC1111 WebUI. Navigate to the PNG Info page.

Drag and drop the image to the Source canvas on the left.

You will see the prompt, the negative prompt, and other generation parameters on the right if it is in the image file. You can optionally send the prompt and settings to the txt2img, img2img, inpainting, or the Extras page for upscaling.

Alternatively, you can use this free site to view the PNG metadata without using AUTOMATIC1111.

Method 2: Guess prompts from images with CLIP interrogator

More often than not, the first method does not work. The generation information may not have been written in the first place. It may have been there, but the web server stripped it during image optimization. Or it was not generated by Stable Diffusion.

In this case, your next option is to use a CLIP interrogator. It is a class of AI models that guess the captions of images. It works on any images, not just AI images.

What is CLIP?

CLIP (Contrastive Language–Image Pre-training) is a neural network that maps visual concepts to natural languages. A CLIP model is trained with a massive number of image and caption pairs.

Contrastive Language–Image Pre-training (CLIP) (Image credit: OpenAI.)

Given an image, a CLIP model can infer the caption that describes the image. You use the caption as the prompt in our use case.

WebUI’s native CLIP interrogator

If you don’t want to install any extension, you can use AUTOMATIC1111’s native CLIP interrogator on the img2img page. It uses BLIP, a CLIP model described in the article “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” by Junnan Li and coworkers.

To use the native CLIP interrogator:

  1. Open AUTOMATIC11111.

2. Navigate to the img2img page.

3. Upload the image to the img2img canvas.

4. Click Interrogate CLIP to get the prompt.

a woman with a wreath of flowers on her head and necklace on her neck, looking at the camera, Elinor Proby Adams, portrait photography, a character portrait, arts and crafts movement

Testing this prompt with the Realistic Vision model and a negative prompt for realistic people, we get the following images.

We get a woman wearing a a wreath with a necklace, although the composition is different.

CLIP interrogator extension

The native CLIP interrogator of AUTOMATIC1111 does not allow you to use a different CLIP model. You must use the CLIP interrogator extension if you want the extra functionalities.

Follow the instructions on the recommended extensions page to install. Here’s the extension’s URL.

https://github.com/pharmapsychotic/clip-interrogator-ext

To use the CLIP interrogator extension.

  1. Open AUTOMATIC1111 WebUI.

2. Navigate to the Interrogator page.

3. Upload the image to the Image canvas.

4. Select ViT-L-14-336/openai under the CLIP Model dropdown menu. This is the language embedding model used in Stable Diffusion v1.5.

5. Click Generate to produce the prompt.

This is what we got.

there is a woman with a flower crown on her head, with depth of field, earthy tones, marigold, portrait of a cute woman, dryad, subject centered in frame, of a young woman, midsommar, portrait face, 65mm 1.5x anamorphic lens, inspired by Elsa Beskow, art : : professional photograph, druid portrait

We get the following images using the same image settings as in the previous section.

Again, it is close but not quite the same. Her necklace is missing in the prompt and hence in the image. Since the outcome of a CLIP interrogator is quite variable, I won’t say the ViT-L-14-336/openai model is worse than BLIP.

Interrogate CLIP for SDXL model

If the prompt is intended to be used with the Stable Diffusion XL (SDXL) model, you can select ViT-g-14/laion2b_s34b_b88k in the CLIP model dropdown menu on the Interrogate page.

This gives the following prompt.

there is a woman with a flower crown on her head, medium portrait top light, f / 1, extra – details, 1 8 yo, national geographic photo shoot, movie scene portrait closeup, inspired by William Morris, center frame portrait, lut, warm glow, bio-inspired, at home, f / 2 0, by Jane Kelly

These are the images generated with the SDXL 1.0 base + refiner model.

The prompt and model did produce images closer to the original composition.

Tips for reproducing an AI image with Stable Diffusion

You should always try the PNG info method (Method 1) first to get prompts from images because, if you are lucky, it gives you the complete information to recreate the image. This includes the prompt, model, sampling method, sampling steps, etc.

You can experiment with BLIP and the CLIP models for Stable Diffusion v1.5 and XL models. ViT-g-14/laion2b_s34b_b88k could work quite well with an v1.5 model, not just the SDXL.

Don’t hesitate to revise the prompt. As the examples above show, the prompt can be incorrect or missing some objects. Edit the prompt accordingly to describe the image correctly.

Choosing an appropriate checkpoint model is important. The prompt won’t necessarily include the correct style. For example, choose a realistic model if you want to generate realistic people.

Finally, the nuclear option is to use an image prompt. The SD v1.5 Plus model can reproduce an image faithfully with an appropriate prompt.

aizmin: