AnimateDiff: Transform Text into Videos

Video generation with Stable Diffusion is improving at unprecedented speed. In this post, you will learn how to use AnimateDiff, a video production technique detailed in the article AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuningby Yuwei Guo and coworkers.

AnimateDiff is one of the easiest ways to generate videos with Stable Diffusion. In the most basic form, you only need to write a prompt, pick a model, and turn on AnimateDiff!

This is what AnimateDiff videos look like.

Trying out the new armor for the first time.
AnimateDiff with Prompt Travel.

We will cover:

  1. How AnimateDiff works
  2. Installing the AnimateDiff extension
  3. Step-by-step guide to generating a video
  4. Advanced settings
  5. Using Motion LoRA to enhance movement
  6. Stylize a video with ControlNet and AnimateDiff
  7. Using AnimateDiff with image-to-image
  8. AnimateDiff prompt travel
  9. Increase resolution with high res fix
  10. Speeding up AnimateDiff

What is AnimateDiff?

AnimateDiff turns a text prompt into a video using a Stable Diffusion model. You can think of it as a slight generalization of text-to-image: Instead of generating an image, it generates a video.

How does AnimateDiff work?

But how does it do that?

AnimateDiff uses a control module to influence a Stable Diffusion model. It is trained with a variety of short video clips. The control module conditions the image generation process to produce a series of images that look like the video clips it learns.

AnimateDiff pipeline – training and inference. (Image from AnimateDiff paper)

Like ControlNet, the control module of AnimateDiff can be used with ANY Stable Diffusion model. Currently, only Stable Diffusion v1.5 models are supported.

Limitation of AnimateDiff

Since it follows the motion learned from the training data, it produces a generic motion that’s typically seen. It won’t produce a video that follows a detailed sequence of motions in the prompt.

The quality of motion is sensitive to the training data. It can’t animate exotic graphics that is not present in the training data. Keep this in mind when you choose what to animate. Not all subjects and styles are equal.

There are tricks to improve the motion:

  1. Change the prompt during video generation. This technique is called prompt travel.
  2. Use a reference video with ControlNet.

You will learn both techniques in this article.

Software setup

We will use AUTOMATIC1111 Stable Diffusion WebUI. It is a popular and free open-source software. You can use this GUI on WindowsMac, or Google Colab.

Check out the Quick Start Guide if you are new to Stable Diffusion. Check out the AUTOMATIC1111 Guide if you are new to AUTOMATIC1111.

Installing AnimateDiff extension

We will use the AnimateDiff extension for Stable Diffusion WebUI.

Google Colab

Installing AnimateDiff in the Colab Notebook in the Quick Start Guide is easy. All you need to do is check the AnimateDiff option in the Extensions section.

Windows or Mac

To install the AnimateDiff extension in AUTOMATIC1111 Stable Diffusion WebUI:

  1. Start AUTOMATIC1111 Web-UI normally.

2. Navigate to the Extension Page.

3. Click the Install from URL tab.

4. Enter the extension’s URL in the URL for extension’s git repository field.

https://github.com/continue-revolution/sd-webui-animatediff

5. Wait for the confirmation message that the installation is complete.

6. Restart AUTOMATIC1111.

Downloading motion modules

(You don’t need to do this step if you are using our Colab notebook.)

You need to download at least one motion module before using AnimateDiff. They can be found on the original authors’ Hugging Face page.

If you only want to download the latest versions of the motion modules, get:

Here are the older versions.

Put the motion module ckpt files in the folder stable-diffusion-webui>extensions > sd-webui-animatediff > model.

Generating a video with AnimateDiff

Let’s generate a video of a happy girl trying out her new armor in the living room.

New shiny armor, yeah!

Step 1: Select a Stable Diffusion model

I’m going with a realistic character in this example. Let’s use CyberRealistic v3.3. Download the model and put it in stable-diffusion-webui > models > Stable-Diffusion.

In the Stable Diffusion checkpoint dropdown menu, select cyberrealistic_v33.safetensors.

Step 2: Enter txt2img settings

On the txt2img page, Enter the following settings.

  • Prompt

((best quality)), ((masterpiece)), ((realistic)), long highlighted hair, cybergirl, futuristic silver armor suit, confident stance, high-resolution, living room, smiling, head tilted

  • Negative Prompt:

CyberRealistic_Negative-neg

Note: CyberRealistic_Negative is a negative embedding (guide to install).

  • Steps: 20
  • Sampler: DPM++ 2M Karras
  • CFG scale: 10
  • Seed: -1
  • Size: 512×512
txt2img settings.

Adjust the batch count to generate multiple videos in one go.

Step 3: Enter AnimateDiff settings

On the txt2img page, scroll down to the AnimateDiff section.

Enter the following settings.

  • Motion Module: mm_sd_v15_v2.ckpt
  • Enable AnimateDiff: Yes
  • Number of frames: 32 (This is the length of the video)
  • FPS: 8 (This is frame per second. So the video length is 32 frames / 8 fps = 4 secs)

You can leave the rest as default.

Select MP4 in the Save options if you want to save an MP4 video.

Step 4: Generate the video

Press Generate to create a video. You should get something similar to this.

Advanced options

You can find detailed explanations of parameters on the GitHub page. Below are some explanations with illustrated examples.

Close loop

The close loop option makes a video continuous. That is the first frame, the same as the last frame. So you won’t see the sudden jump when transitioning from the last to the first frame.

  • N: Close loop is not used.
  • R-P: Reduce the number of closed-loop contexts. The prompt travel will NOT be interpolated to be a closed loop.
  • R+P: Reduce the number of closed-loop contexts. The prompt travel WILL BE interpolated to be a closed loop.
  • A: Make the last frame the same as the first frame. The prompt travel WILL BE interpolated to be a closed loop.
Close loop R-P.
Close loop A.

Frame interpolation

Frame interpolation makes the video look smoother by increasing the number of frames per second.

Set Frame Interpolation to FILM, and Interp X to a multiplier of FPS. E.g. Setting it to 5 makes an 8 FPS video 40 FPS.

For some reason, you will need to set FPS to 8 times 5 = 40 FPS in order to make the animated GIF look right. Otherwise, it would be in slow motion.

Frame interpolation (5x) makes the video look smoother.

Context batch size

Context batch size controls the temporal consistency. A higher context batch size makes the video changes less. A small value makes it change more.

But the quality seems to degrade when it is different from 16. So it is better to keep it at 16.

Video-to-video with AnimateDiff

You can direct the motion with a reference video using ControlNet.

Let’s use this reference video as an example. The goal is to have AnimateDiff follow the girl’s motion in the video.

Step 1: Upload video

On the txt2img page, scroll down the AnimateDiff section.

Upload the video to the Video source canvas.

Step 2: Enter AnimateDiff settings

The number of frames and FPS parameters should match the video. They should have populated automatically. They are:

  • Number of frames: 96
  • FPS: 29

Don’t forget to enable AnimateDiff.

  • Enable AnimateDiff: Yes

Step 3: Enter txt2img settings

The txt2img parameters are:

  • Checkpoint model: cyberrealistic_v33.safetensors
  • Prompt

((best quality)), ((masterpiece)), ((realistic)), long highlighted hair, cybergirl, futuristic silver armor suit, confident stance, high-resolution, living room, smiling, head tilted

  • Negative Prompt:

CyberRealistic_Negative-neg

  • Steps: 20
  • Sampler: DPM++ 2M Karras
  • CFG scale: 10
  • Seed: -1
  • Size: 512×512

Step 4: Turn on ControlNet

You must enable ControlNet to copy the reference’s video.

In the ControlNet Unit 0 section:

Let’s use DW Openpose.

  • Enable: Yes
  • Preprocessor: dw_openpose_full
  • Model: Openpose

Step 5: Generate video

Press Generate.

(The AnimateDiff extension is finicky. If it errors out, try to press Generate again. If it still doesn’t work, restart A1111 completely and try again.)

Openpose

Here’s the AnimateDiff video with Openpose.

You can see the pose is translating well, but other objects and background keep changing.

Canny

You can also use Canny Edge detection for OpenPose.

  • Enable: Yes
  • Preprocessor: Canny
  • Model: Canny
  • Control weight: 1

HED and OpenPose

Let’s use TWO ControlNet to fix the pose AND the lines.

ControlNet 0:

  • Enable: Yes
  • Preprocessor: Softedge_HED
  • Model: control_v11p_sd15_softedge
  • Control weight: 0.5

ControlNet 1:

  • Enable: Yes
  • Preprocessor: dw_openpose_full
  • Model: control_v11p_sd15_openpose
  • Control weight: 0.5

Motion LoRA

You can use motion LoRA to add camera movement to the video. They are used in the same way as the standard LoRA.

Installing motion LoRA

You can download the motion LoRAs at the following link.

Motion LoRA download page

Download all files with lora as part of the filename.

Put them in stable-diffusion-webui > models > Lora.

Using motion LoRA

All you need to do is add the motion LoRA in the prompt. So, for example,

Prompt:

((best quality)), ((masterpiece)), ((realistic)), long highlighted hair, cybergirl, futuristic silver armor suit, confident stance, high-resolution, living room, smiling, head tilted <lora:v2_lora_PanLeft:1>

Negative prompt:

CyberRealistic_Negative-neg

You see the background is moving to the right, suggesting the camera is panning to the left.

But using LoRA weight 1 seems to create an artifact in the background. Reducing the LoRA weight to 0.75 produces better results.

((best quality)), ((masterpiece)), ((realistic)), long highlighted hair, cybergirl, futuristic silver armor suit, confident stance, high-resolution, living room, smiling, head tilted <lora:v2_lora_PanLeft:0.75>

Pan left

Zoom in:

Uh..?

Zoom out:

Ah…?

Image-to-image

You can direct the composition and motion to a limited extent by using AnimateDiff with img2img. In this method, you can define the initial and final images of the video. They cannot be used exactly because they will undergo the image-to-image process.

Navigate to the img2img page in AUTOMATIC1111.

Enter the img2img settings.

  • Prompt:

((best quality)), ((masterpiece)), ((realistic)), long highlighted hair, cybergirl, futuristic silver armor suit, confident stance, high-resolution, living room, smiling, head tilted <lora:v2_lora_ZoomOut:1>

  • Negative prompt:

CyberRealistic_Negative-neg

Upload the initial image to the image canvas of the img2img tab.

  • Steps: 20
  • Sampler: DPM++ 2M Karras
  • CFG scale: 7
  • Seed: -1
  • Size: 512×512
  • Denoising strength: 0.75
  • Motion Module: mm_sd_v15_v2.ckpt
  • Enable AnimateDiff: Yes
  • Number of frames: 32
  • FPS: 8

You can leave the rest as default.

Upload an image to the optional last frame canvas.

Press Generate.

AnimateDiff Prompt travel

Do you feel the motion of AnimateDiff is a bit lacking? You can increase the motion by specifying different prompts at different time points. This feature is generally known as prompt travel in the Stable Diffusion community.

This is how prompt travel works. Let’s say you specify prompt 1 at the 1st frame and prompt 2 at the 10th frame. The prompts at frame 1 and frame 10 are prompt 1 and prompt 2 for sure. It interpolates the prompts between frame 1 and frame 10.

Without prompt travel

Prompt:

(masterpiece, top quality, best quality, official art, beautiful and aesthetic:1.2), (1girl), extreme detailed,(fractal art:1.3),colorful,highest detailed

Negative prompt:

(worst quality:2), (low quality:2), (normal quality:2), lowres, bad anatomy, normal quality, ((monochrome)), easynegative, badhandv4

With prompt travel

Use the prompt in the following format to use prompt travel

(masterpiece, top quality, best quality, official art, beautiful and aesthetic:1.2), (1girl), extreme detailed,(fractal art:1.3),colorful,highest detailed
0: smile
8: (arm over head:1.2)
studio lighting

The first line is the prompt prefix. The last line is the prompt suffix. They are added to the beginning and ending of the prompt, respectively.

In the middle, we specify the prompts at different frames.

Here’s what you get:

Increasing resolution with Hi Res fix

You can use AnimateDiff with Hi Res fix to increase resolution.

  • Upscaler: 4x-UltraSharp
  • Hires steps: 10
  • Denoising strength: 0.6
  • Upscale by: 1.4

AnimateDiff v3

AnimateDiff v3 is not a new version of AnimateDiff, but an updated version of the motion module. All you need to do to use it is to download the motion module and put it in the stable-diffusion-webui > models > animatediff folder. You can download the v3 motion module for AUTOMATIC1111 here.

You can use the Animate v3 motion module the same way as v2.

Model: Hello Young

Prompt:

(masterpiece, top quality, best quality, official art, beautiful and aesthetic:1.2), (1girl), extreme detailed,(fractal art:1.3),colorful,highest detailed

Negative prompt:

(worst quality:2), (low quality:2), (normal quality:2), lowres, bad anatomy, normal quality, ((monochrome)), easynegative, badhandv4

This is an AnimateDiff video using AnimateDiff v3.

Below is an animation with the same settings but using the v2 model for comparison.

In my testing, I cannot say v3 is better than v2. They generate different motions. You can keep both in your toolbox and see which one works better in your particular workflow.

AnimateDiff for SDXL

AnimateDiff SDXL is not a new version of AnimateDiff, but a motion module that is compatible with the Stable Diffusion XL model. You need to download the SDXL motion module and put it in the stable-diffusion-webui > models > animatediff folder.

You can use the Animate SDXL motion module the same way as other motion modules. Remember to set it to an image size compatible with the SDXL model, e.g. 1024 x 1024.

You can use any SDXL model, not just the base model.

Below is an example of AnimateDiff SDXL.

Checkpoint Model: dreamshaperXL10_alpha2Xl10

Prompt:

In Casey Baugh’s evocative style, art of a beautiful young girl cyborg with long brown hair, futuristic, scifi, intricate, elegant, highly detailed, majestic, Baugh’s brushwork infuses the painting with a unique combination of realism and abstraction, greg rutkowski, surreal gold filigree, broken glass, (masterpiece, sidelighting, finely detailed beautiful eyes: 1.2), hdr, realistic painting, natural skin, textured skin, closed mouth, crystal eyes, butterfly filigree, chest armor, eye makeup, robot joints, long hair moved by the wind, window facing to another world, Baugh’s distinctive style captures the essence of the girl’s enigmatic nature, inviting viewers to explore the depths of her soul, award winning art

Negative prompt:

ugly, deformed, noisy, blurry, low contrast, text, BadDream, 3d, cgi, render, fake, anime, open mouth, big forehead, long neck

Image size: 1024 x 1024

Motion module: mm_sdxl_v10_beta.safetensors

Speeding up AnimateDiff

Video generation can be slow. AnimateDiff is no exception. Here are a few ways you can speed up video generation with AnimateDiff.

LCM LoRA

LCM LoRA is a LoRA model for speeding up Stable Diffusion. You can expect the video generation to be 3 times faster.

Follow the LCM LoRA tutorial to install the LCM LoRA modules. There are SD 1.5 and SDXL versions available.

The image settings of LCM LoRA is quite different. It is important to first nail down the settings without using AnimateDiff.

Model: Hello Young

Sampling method: LCM

Sampling steps: 7

CFG Scale: 2

Prompt:

(masterpiece, top quality, best quality, official art, beautiful and aesthetic:1.2), (1girl), extreme detailed,(fractal art:1.3),colorful,highest detailed <lora:lcm_lora_sd15:1>

Negative prompt:

(worst quality:2), (low quality:2), (normal quality:2), lowres, bad anatomy, normal quality, ((monochrome)), easynegative, badhandv4

With AnimateDiff off, it generates a good image.

Now, turn AnimateDiff on.

Motion Module: mm_sd15_v3.safetensors

Enable AnimateDiff: Yes

SDXL Turbo

SDXL Turbo models have the same architecture as other SDXL models but the Turbo training method enables fewer sampler steps. You can expect the video generation to be 3 times faster.

Checkpoint Model: dreamshaperXL10_alpha2Xl10

It is important to use the following sampling method, steps, and CFG scale. Or else the quality would be poor.

Sampling method: DPM++ SDE Karras

Sampling steps: 7

CFG Scale: 2

Prompt:

In Casey Baugh’s evocative style, art of a beautiful young girl cyborg with long brown hair, futuristic, scifi, intricate, elegant, highly detailed, majestic, Baugh’s brushwork infuses the painting with a unique combination of realism and abstraction, greg rutkowski, surreal gold filigree, broken glass, (masterpiece, sidelighting, finely detailed beautiful eyes: 1.2), hdr, realistic painting, natural skin, textured skin, closed mouth, crystal eyes, butterfly filigree, chest armor, eye makeup, robot joints, long hair moved by the wind, window facing to another world, Baugh’s distinctive style captures the essence of the girl’s enigmatic nature, inviting viewers to explore the depths of her soul, award winning art

Negative prompt:

ugly, deformed, noisy, blurry, low contrast, text, BadDream, 3d, cgi, render, fake, anime, open mouth, big forehead, long neck

Image size: 1024 x 1024

Motion module: mm_sdxl_v10_beta.safetensors

Troubleshooting

AnimateDiff produces 2 distinct videos instead of one

The prompt may be too long. In AUTOMATIC1111 > Settings > Optimization, Check Pad prompt/negative prompt to be same length.

aizmin: