Efficient Video Diffusion with img2vid: A Comprehensive Guide

Stable Video Diffusion is the first Stable Diffusion model designed to generate video. You can use it to animate images generated by Stable Diffusion, creating stunning visual effects.

Here are a few sample videos.

From the Realistic Egyptian Princess workflow.

From the Biomechanical animal workflow:

From the Castle in Fall workflow:

In this article, you will learn about

  • What Stable Video Diffusion is.
  • How to use it on Google Colab online.
  • How to use the txt-to-video workflow in ComfyUI.
  • How to install and use it locally on Windows.

What is Stable Video Diffusion

Stable Video Diffusion (SVD) is the first foundational video model released by Stability AI, the creator of Stable Diffusion. It is an open-source model, with code and model weights freely available.

What it does

SVD is an image-to-video (img2vid) model. You supply the first frame, and the model will generate a short video clip. Below is an example of the input and output of the model.

Image input to the SVD model.
Video output of the SVD model.

Model and training

The model and training are described in the article Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Dataset (2023) by Andreas Blattmann and coworkers.

The SVD model has gone through 3 stages of training.

  1. Train an image model.
  2. Extend the image model to a video model, which is then pretrained with a large dataset of videos.
  3. Finetune the video model with a smaller dataset of high-quality videos.

The curation and improvement of the dataset are key to the success of the video model.

The image model is Stable Diffusion 2.1, the forgotten predecessor of the SDXL model. The pretrained image model forms the image backbone of the video model.

Temporal convolution and attention layers are added to the U-Net noise estimator to create the video model. Now, the latent tensor represents a video instead of an image. All frames are denoised with reverse diffusion at the same time. This temporal diffusion model is the same as the VideoLDM model.

All latent frames are diffused together (from the VideoLDM paper.)

The video model has 1.5B parameters and is trained with a large video dataset. Finally, the video model is fine-tuned with a smaller but higher-quality dataset.

Stable Stable Video Models weights

Two SVD model weights are publicly available.

  • SVD – trained to generate 14 frames at resolution 576×1024.
  • SVD XT – trained to generate 25 frames at resolution 576×1024.

We will focus on using the SVD XT model in this article.

Model parameters

Below is a list of important parameters that control the video output.

Motion bucket id

The motion bucket id controls how much motion is in the video. A higher value means more motion. Accepts a value between 0 and 255.

FPS

The frames per second (fps) parameter controls the number of frames the model generates. Stay between 5 and 30 for optimal performance.

Augmentation level

The augmentation level is the amount of noise added to the initial image. Use it to change the initial image more or when generating videos that deviate from the default size.

Use Stable Video Diffusion on Colab

You need a high VRAM NVidia GPU card to run Stable Video Diffusion locally. If you don’t have one, the best option is Google Colab online. The notebook works with the free account.

Step 1: Open the Colab Notebook

Go to the GitHub page of the Colab notebook. Give me a star (Okay, this is optional…). Click the Open in Colab icon to open the notebook.

Here’s the direct link to the notebook.

Step 2: Review the notebook option

The default setting is good to go. But you can optionally not save the final video in your Google Drive.

Step 3: Run the notebook

Click the run button to start running the notebook.

Step 4: Start the GUI

After it is done loading, you should see a gradio.live link. Click the link to start the GUI.

Step 5: Upload an initial image

Drop an image you wish to use as the first frame of the video.

Adjust the crop offset to adjust the position of the crop.

Step 6: Start video generation

Click Run to start the video generation. The video will appear on the GUI when it is done.

It takes about 9 minutes on a T4 GPU (free account) and 2 minutes on a V100 GPU.

Customize your video

You can increase the Motion Bucket ID parameter in the advanced settings to increase the motion in the video.

Use a fixed integer for the seed parameter to generate the same video.

Use Stable Video Diffusion with ComfyUI

ComfyUI now supports the Stable Video Diffusion SVD models. Follow the steps below to install and use the text-to-video (txt2vid) workflow. It generates the initial image using the Stable Diffusion XL model and a video clip using the SVD XT model.

Read the ComfyUI installation guide and ComfyUI beginner’s guide if you are new to ComfyUI.

Step 1: Load the text-to-video workflow

Download the ComfyUI workflow below.

Drag and drop it to ComfyUI.

Step 2: Update ComfyUI

Update ComfyUI, install missing custom nodes, and update all custom nodes. Using the ComfyUI manager will make this step easier.

Restart ComfyUI completely and load the text-to-video workflow again. ComfyUI should have no complaints if everything is updated correctly.

Step 3: Download models

Download the SVD XT model. Put it in the ComfyUI > models > checkpoints folder.

Refresh the ComfyUI page and select the SVD_XT model in the Image Only Checkpoint Loader node.

The workflow uses the SDXL 1.0 model. Download the model if you have not already. Put it in the ComfyUI > models > checkpoints folder.

Refresh the ComfyUI page and select the SDXL model in the Load Checkpoint node.

Step 4: Run the workflow

Click Queue Prompt to run the workflow. A video should be generated.

Parameters

video_frame: Number of frames. Keep it at 25 since this is what the model is trained.

motion_bucket_id: Controls how much motion is in the video. A higher value means more motion.

fps: Frames per second.

Augmentation_level: The amount of noise added to the initial image. The higher it is, the more different the video is from the initial frame. Increase it when you use a video size different from the default.

min_cfg: Sets the CFG scale at the beginning of the video. The CFG scale changes linearly to the cfg value defined in the KSampler node at the end of the video. In this example, min_cfg is set to 1.0, and cfg is set to 2.5. The CFG scale is 1.0 for the first frame, 2.5 for the last frame, and varies linearly in between. The more further away from the first frame, the higher CFG scale it gets.

Install Stable Video Diffusion on Windows

You can run Stable Video Difusion locally if you have a high-RAM GPU card. The following installation process is tested with a 24GB RTX4090 card.

It is difficult to install this software locally. You may encounter issues not described in this section. So proceed only if you are tech-savvy, or want to be…

You will need git and Python 3.10 to install and use the software. See the installation guide for Stable Diffusion for steps to install them.

Step 1: Clone the repository

Open the PowerShell App. DON’T use the Command Prompt (cmd). It won’t work with these instructions.

To Open the PowerShell App, press the Windows key and search for “PowerShell”. Click the Windows PowerShell App to start.

Before you start, confirm you have Python 3.10 by running the following command.

python --version

You are good to proceed if it says “Python 3.10.x”.

You can change the directory to the one in which you want to install the software.

git clone https://github.com/Stability-AI/generative-models

Step 2: Create a virtual environment

Go into the newly cloned folder.

cd generative-models

Create a virtual environment.

python -m venv venv

You should see a folder called venv created.
Activate the virtual environment.

.venvScriptsActivate.ps1

If this command is successful, you should see (venv) in front of your command prompt. This indicates you are now in the virtual environment.

You must be in the virtual environment when installing or running the software.

If you don’t see the (venv) label in a later step, run the activate.ps1 script to enter the virtual environment.

Step 3: Remove the triton package in requirements

In the File Explorer App, navigate to the folder generative-models > requirements.

Open the requirement file pt2.txt with the Notepad App.

Remove the line “triton==2.0.0”. This is not strictly needed and will cause errors in Windows.

Save and close the file.

Step 4: Install the required libraries

Go back to the PowerShell App. Make sure you still see the (venv) label.

Run the following command to install PyTorch.

 pip3 install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Run the following command to install the required libraries.

pip3 install -r .requirementspt2.txt

Run the following command to install the generative model software.

pip3 install .

Run the following command to install a required library.

pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata

Step 5: Download the video model

In the File Explorer App, navigate to the generative-models folder and create a folder called “checkpoints”.

Navigate to the folder generative-models > checkpoints.

Download the safetensors model (svd_xt.safetensors) and put it in the checkpoints model directory.

Step 6: Run the GUI

Go back to the PowerShell App. You should be in the generative-models folder and in the virtual environment.

Run the following command to set the Python path.

$ENV:PYTHONPATH=$PWD

Run the following command to start the GUI.

streamlit run scripts/demo/video_sampling.py

A new webpage should be opened. If it didn’t, see the printout of the PowerApp terminal. Go to the Local URL. It should be something like:

http://localhost:8501

Step 7: Generate a video

In the Model Version dropdown menu, select svd_xt.

Click the Load Model checkbox.

Watch the PowerShell terminal for errors.

It may show an error message in the GUI. But it is okay as long as the new Input section appears.

Drop an image as the initial frame to the Input box.

Scroll down and find the Decode t frames at a time field. Set it to 1.

Click Sample to start the video generation.

Watch the PowerShell terminal for progress.

When it is done, the video will show up on the GUI.

Close the PowerShell App when you are done.

Starting the GUI again

To start the GUI again, open the PowerShell App.

Navigate to the generative-models folder.

cd generative-models

Activate the virtual environment.

.venvScriptsActivate.ps1

Run the following command to set the Python path.

$ENV:PYTHONPATH=$PWD

Run the following command to start the GUI.

streamlit run scripts/demo/video_sampling.py

Resources

Stable Video Diffusion Colab notebook

Introducing Stable Video Diffusion – Official press release of SVD.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets – The research paper.

Stability-AI/generative-models: Generative Models by Stability AI – code on GitHub page.

stabilityai/stable-video-diffusion-img2vid-xt – Model weights on Hugging Face.

aizmin: