Stable Diffusion: Converting Text to Video

Text-to-video is the challenging task of turning a text description into a video. Diffusion-based text-to-video model is improving at a rapid speed. Now, these models become usable and can be run locally on your machine. In this post, you will learn a few ways to convert a text prompt to a video.

  1. AnimateDiff
  2. ModelScope
  3. Deforum

Software

We will use AUTOMATIC1111 Stable Diffusion GUI. You can use this GUI on WindowsMac, or Google Colab.

Check out the Quick Start Guide if you are new to Stable Diffusion.

1. AnimateDiff

(See the detailed guide for using AnimateDiff)

AnimateDiff is a text-to-video module for Stable Diffusion. It was trained by feeding short video clips to a motion model to learn how the next video frame should look like. Once this prior is learned, animateDiff injects the motion module to the noise predictor U-Net of a Stable Diffusion model to produce a video based on a text description.

animaediff model training and inference.
AnimateDiff pipeline (Image from the AnimateDiff paper.)

So, you can’t really control what happens in the video. They come from what the model learned from the training data.

You can use AnimateDiff with any Stable Diffusion checkpoint model and LoRA.

Installing the AnimateDiff extension

Google Colab Notebook

Installing the AnimateDiff extension on our Stable Diffusion Colab notebook is easy. All you need to do is to select the AnimateDiff extension,

Windows or Mac

Follow these steps to install the AnimateDiff extension in AUTOMATIC1111.

  1. Start AUTOMATIC1111 Web-UI normally.

2. Navigate to the Extension Page.

3. Click the Install from URL tab.

4. Enter the following URL in the URL for extension’s git repository field.

https://github.com/continue-revolution/sd-webui-animatediff

5. Wait for the confirmation message that the installation is complete.

6. Download the AnimateDiff motion models and save it in stable-diffusion-webui > extensions > sd-webui-animatediff > model folder.

Direct download link for v1.5 v2 motion model:

https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15_v2.ckpt

Direct download link for v1.4 motion model:

https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v14.ckpt

Direct download link for v1.5 motion model:

https://huggingface.co/guoyww/animatediff/resolve/main/mm_sd_v15.ckpt

7. Restart Web-UI.

Using AnimateDiff

To use AnimateDiff in AUTOMATIC1111, navigate to the txt2img page. In the AnimateDiff section,

  • Enable AnimateDiff: Yes
  • Motion Module: There are two motion modules you can choose from. The v1.4 model creates more motion, but the v1.5 model creates clearer animations.

Then write a prompt and a negative prompt as usual. For example

1girl, looking at viewer, anime, cherry blossoms

disfigured, deformed, ugly

You can select any v1 model. Let’s use the Anything v3 model for this example.

CFG: 15 (It is VERY IMPORTANT to set it to a high value 10 – 25)

Sampler: DPM++2M Karass

Here are animations from the v1.4 and v1.5 motion models.

AnimateDiff v1.4 motion model (CFG 15)
AnimateDiff v1.5 model (CFG 20)

Generally, you will see the v1.4 motion model producing more motion and the v1.5 model producing clearer images.

Tips

Here are a few tips for you to increase your chance of success.

  • Increase the CFG scale if you get a gray-looking image. You typically need it to be between 10 and 25.
  • Try switching the motion module (v1.4 and v1.5) if you see watermarks on images.
  • You can use LoRA with AnimateDiff.
  • Keep the number of frames to 16 for optimal performance because this is the number it is trained.
  • You WILL need to cherry-pick to get good videos. Generate more!
  • Change the prompt if you get two short videos in one.

Example prompts and settings

Waves

Prompt:

A beautiful beach, waves

Negative prompt:

watermark, letters

CFG Scale: 20

Model: Dreamshaper 5

Motion module: v1.4

Girl in red sweater

Prompt:

1girl, tohsaka rin, solo, long hair, sweater, red sweater, looking at viewer, walking <lora:3DMM_V12:1>

Negative prompt:

disfigured, deformed, ugly

CFG Scale: 16

Model: Rev Animated

LoRA: 3D rendering style

Motion module: v1.4

Realistic woman

Prompt:

close up photo of young woman, highlight hair, sitting outside restaurant, wearing dress, rim lighting, studio lighting, looking at the camera, dslr, ultra quality, sharp focus, tack sharp, dof, film grain, Fujifilm XT3, crystal clear, 8K UHD, highly detailed glossy eyes, high detailed skin, skin pores

Negative prompt

disfigured, ugly, bad, immature, cartoon, anime, 3d, painting, b&w

CFG Scale: 15

Model: Realistic Vision

Motion module: v1.4

2. ModelScope

Video generation pipeline of Modelscope (from Modelscope paper.)

ModelScope is a diffusion-based text-to-video model. The idea behind the model is the observation that the frames of a video are mostly similar. ModeScope is a latent diffusion model. So the first frame starts as a latent noise tensor, the same as Stable Diffusion’s text-to-image. The innovation is that the model decomposes the noise into two parts: (1) the base noise and (2) the residual noise. The base noise is shared across ALL frames. The residual noise varies in each frame.

Use this ModelScope demo if you don’t want to install the extension in AUTOMATIC1111.

Installing ModelScope in AUTOMATIC1111

You will need to install the text2video extension. Follow these steps to install the extension in AUTOMATIC1111.

Google Colab

Installing the text2video extension on our Stable Diffusion Google Colab notebook is easy. All you need to do is to select the text2video extension.

Windows and Mac

  1. Start AUTOMATIC1111 Web-UI normally.

2. Navigate to the Extension Page.

3. Click the Install from URL tab.

4. Enter the following URL in the URL for extension’s git repository field.

https://github.com/kabachuha/sd-webui-text2video

5. Wait for the confirmation message that the installation is complete.

6. Restart webUI completely. (By restarting the terminal.)

NOTE: If you encounter an error related to importing tqdm, you will need to reinstall a specific version of tqdm. Run the following command in the Terminal App inside stable-diffusion-webui

Windows:

.venvScriptspip.exe install tqdm==4.65.0

Mac:

./venv/bin/pip install tqdm==4.65.0

7. Create the model folders. You need to create the folder structure.

stable-diffusion-webuimodelstext2videot2v

In other words, create the model text2video inside the models folder. Then create the folder t2v inside the text2video folder.

Download the text-to-video model files here and put them in the t2v folder. You will need the following 4 files

  • VQGAN_autoencoder.pth
  • configuration.json
  • open_clip_pytorch_model.bin
  • text2video_pytorch_model.pth

That’s a lot of work! But that’s it… for now.

Using ModelScope

To use ModelScope, navigate to the txt2video page in AUTOMATIC1111.

Model type: ModelScope

Model: t2v

Prompt: Write a prompt to describe the video. Not all prompts work. If you don’t see what you write, try other prompts. I used

a lion and a man in a suit fighting

Adjust Frames to control the length of the video. I used 100 in this example.

Press Generate and wait for it to complete. Click Update the video. This is what I got.

Modelscope prompt examples

Here’s one:

a female model

Here’s another one:

1girl, smiling, looking at viewer ,beautiful symmetrical face, nose, mouth, sharp focus, 8k

Negative prompt:

text, watermark, copyright, blurry, nsfw, frame, border, ugly, disfigured

Not the Hollywood-quality movie clips you expected? Yeah, this is what we can do for now. At least it produces temporal and spatial consistency across frames.

img2vid

There’s no documentation I can find for the img2vid function. But I found it quite useful for directing the initial frame and hence the whole video.

Modelscope tends to generate a face that is too close up. See the video above generated with the 1girl prompt.

Uploading a proper woman portrait in the Inpainting image canvas and setting the inpainting frames to 1. You get a much better-composed video of a woman smiling.

Fine-tuned models

Just like Stable Diffusion, there are fine-tuned models for Modelscope. They are created with additional training. You can find a list of fine-tuned models here.

Zeroscope: Watermark-free text-to-video

You may have noticed watermarks on some of the videos generated with the Modelscope base model. There’s a fine-tuned model to resolve this issue: Zeroscope v2.

First, let’s download the model.

1. Create the new folder stable-diffusion-webui > models > text2video > zeroscope_v2_576w.

2. You need 4 files in this folder. The first two can be found here. The other two can be downloaded from the Modelscope base model’s repository. (Or copy over from the base model’s folder)

Now let’s generate a video with the new model. This model generates videos with 576 x 320 pixels.

  • Model Type: ModelScope
  • Model: zeroscope_v2_576w
  • Prompt:

1girl, smiling, looking at viewer ,beautiful symmetrical face, nose, mouth, sharp focus, 8k

  • Negative prompt:

text, watermark, copyright, blurry, nsfw, frame, border, ugly, disfigured

  • Width: 576 (You have to use this width and height)
  • Height: 320
  • Frame: 30

Press Generate. Here’s what I got. Higher resolution and no more watermark on the video!

Upscaling the video with Zeroscope v2 XL

You can make the video larger by doing a video-to-video. It is kind of the same as image-to-image, except now you are transforming a video into another video.

Zeroscope v2 XL is an upscaler model to enlarge a video created by the Zeroscope v2 576 model.

But first, let’s install the model.

1. Create a new folder stable-diffusion-webui > models > text2video > zeroscope_v2_XL

2. You need 4 files in this folder. The first two can be found here. The other two can be downloaded from the Modelscope base model’s repository. (Or copy over from the base model’s folder)

To upscale a video with the Zeroscope v2 XL model, switch to the vid2vid tab.

Select the model: Zeroscope_v2_XL.

Upload the video you previously generated with Zeroscope v2 576w to the Input video canvas.

Enter a prompt and a negative prompt. You can reuse the same prompt.

Set width to 1024 and height to 576. (IMPORTANT: You must use these dimensions)

Set Frames to 30.

Set denoising strength to 0.7

Set vid2vid start frame to 1. (This helps to preserve the original video)

Press Generate.

Here’s the HD video:

Tips

  • Stay with 256×256 video size for the base model. Other sizes don’t work well. Similarly, stay with the default resolutions for the fine-tuned model.
  • As in prompting Stable Diffusion models, describe what you want to SEE in the video. But some subjects just don’t work. Don’t be too hang up and move on to other keywords.

3. Deforum

Deforum generates videos using Stable Diffusion models. It achieves video consistency through img2img across frames. Since the input are multiple text prompts, it qualifies as a text-to-video pipeline.

Read the Deforum tutorial.

Additional resources

A Dive into Text-to-Video Models – A good overview of the state of the art of text-to-video AI models.

ModelScope Text To Video Demo – Use ModelScope base model on the Web (Long wait time).

Text2Video-Zero Demo – Text2Video-Zero is another text2video AI.

pschaldenbrand/text2video – Text-to-video demo for “Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization.”

Related Posts