caT text to video

Conditionally augmented text-to-video model. Uses pre-trained weights from modelscope text-to-video model, augmented with temporal conditioning transformers to extend generated clips and create a smooth transition between them. Supports prompt interpolation as well to change scenes during clip extensions.

The model was trained on two RTX 6000 Ada GPUs for 5 million steps using the WebWid 10M dataset, with a batch size of 1 and a learning rate of 1e-6 at a resolution of 320x320. It used 8 frames for conditioning and 8 frames for noisy samples, with a stride of 6.

Installation

Clone the Repository

git clone https://github.com/motexture/caT-text-to-video-2.3b/
cd caT-text-to-video-2.3b
python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python run.py

Visit the provided URL in your browser to interact with the interface and start generating videos.

Examples:

A guy is riding a bike -> A guy is riding a motorcycle

Will Smith is eating a hamburger -> Will Smith is eating an ice cream

A lion is looking around -> A lion is running

Darth Vader is surfing on the ocean

A beautiful anime girl with pink hair -> Anime girl laughing

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for motexture/caT-text-to-video-2.3b

Finetuned
(1)
this model

Dataset used to train motexture/caT-text-to-video-2.3b