|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
pipeline_tag: text-to-video |
|
library_name: diffusers |
|
tags: |
|
- video |
|
- video-generation |
|
--- |
|
|
|
# Wan-Fun |
|
|
|
😊 Welcome! |
|
|
|
[](https://huggingface.co/spaces/alibaba-pai/Wan2.1-Fun-1.3B-InP) |
|
|
|
[](https://github.com/aigc-apps/VideoX-Fun) |
|
|
|
[English](./README_en.md) | [简体中文](./README.md) |
|
|
|
# Table of Contents |
|
- [Table of Contents](#table-of-contents) |
|
- [Model zoo](#model-zoo) |
|
- [Video Result](#video-result) |
|
- [Quick Start](#quick-start) |
|
- [How to use](#how-to-use) |
|
- [Reference](#reference) |
|
- [License](#license) |
|
|
|
# Model zoo |
|
V1.0: |
|
| Name | Storage Space | Hugging Face | Model Scope | Description | |
|
|--|--|--|--|--| |
|
| Wan2.1-Fun-1.3B-InP | 19.0 GB | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) | [😄Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-1.3B-InP) | Wan2.1-Fun-1.3B text-to-video weights, trained at multiple resolutions, supporting start and end frame prediction. | |
|
| Wan2.1-Fun-14B-InP | 47.0 GB | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP) | [😄Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP) | Wan2.1-Fun-14B text-to-video weights, trained at multiple resolutions, supporting start and end frame prediction. | |
|
| Wan2.1-Fun-1.3B-Control | 19.0 GB | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-Control) | [😄Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-1.3B-Control) | Wan2.1-Fun-1.3B video control weights, supporting various control conditions such as Canny, Depth, Pose, MLSD, etc., and trajectory control. Supports multi-resolution (512, 768, 1024) video prediction at 81 frames, trained at 16 frames per second, with multilingual prediction support. | |
|
| Wan2.1-Fun-14B-Control | 47.0 GB | [🤗Link](https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-Control) | [😄Link](https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-Control) | Wan2.1-Fun-14B video control weights, supporting various control conditions such as Canny, Depth, Pose, MLSD, etc., and trajectory control. Supports multi-resolution (512, 768, 1024) video prediction at 81 frames, trained at 16 frames per second, with multilingual prediction support. | |
|
|
|
# Video Result |
|
|
|
### Wan2.1-Fun-14B-InP && Wan2.1-Fun-1.3B-InP |
|
|
|
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> |
|
<tr> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/bd72a276-e60e-4b5d-86c1-d0f67e7425b9" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/cb7aef09-52c2-4973-80b4-b2fb63425044" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/4e10d491-f1cf-4b08-a7c5-1e01e5418140" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/f7e363a9-be09-4b72-bccf-cce9c9ebeb9b" width="100%" controls autoplay loop></video> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> |
|
<tr> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/28f3e720-8acc-4f22-a5d0-ec1c571e9466" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/fb6e4cb9-270d-47cd-8501-caf8f3e91b5c" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/989a4644-e33b-4f0c-b68e-2ff6ba37ac7e" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/9c604fa7-8657-49d1-8066-b5bb198b28b6" width="100%" controls autoplay loop></video> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
### Wan2.1-Fun-14B-Control && Wan2.1-Fun-1.3B-Control |
|
|
|
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> |
|
<tr> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/f35602c4-9f0a-4105-9762-1e3a88abbac6" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/8b0f0e87-f1be-4915-bb35-2d53c852333e" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/972012c1-772b-427a-bce6-ba8b39edcfad" width="100%" controls autoplay loop></video> |
|
</td> |
|
<tr> |
|
</table> |
|
|
|
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> |
|
<tr> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/53002ce2-dd18-4d4f-8135-b6f68364cabd" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/a1a07cf8-d86d-4cd2-831f-18a6c1ceee1d" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/3224804f-342d-4947-918d-d9fec8e3d273" width="100%" controls autoplay loop></video> |
|
</td> |
|
<tr> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/c6c5d557-9772-483e-ae47-863d8a26db4a" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/af617971-597c-4be4-beb5-f9e8aaca2d14" width="100%" controls autoplay loop></video> |
|
</td> |
|
<td> |
|
<video src="https://github.com/user-attachments/assets/8411151e-f491-4264-8368-7fc3c5a6992b" width="100%" controls autoplay loop></video> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
# Quick Start |
|
### 1. Cloud usage: AliyunDSW/Docker |
|
#### a. From AliyunDSW |
|
DSW has free GPU time, which can be applied once by a user and is valid for 3 months after applying. |
|
|
|
Aliyun provide free GPU time in [Freetier](https://free.aliyun.com/?product=9602825&crowd=enterprise&spm=5176.28055625.J_5831864660.1.e939154aRgha4e&scm=20140722.M_9974135.P_110.MO_1806-ID_9974135-MID_9974135-CID_30683-ST_8512-V_1), get it and use in Aliyun PAI-DSW to start CogVideoX-Fun within 5min! |
|
|
|
[](https://gallery.pai-ml.com/#/preview/deepLearning/cv/cogvideox_fun) |
|
|
|
#### b. From ComfyUI |
|
Our ComfyUI is as follows, please refer to [ComfyUI README](comfyui/README.md) for details. |
|
 |
|
|
|
#### c. From docker |
|
If you are using docker, please make sure that the graphics card driver and CUDA environment have been installed correctly in your machine. |
|
|
|
Then execute the following commands in this way: |
|
|
|
``` |
|
# pull image |
|
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun |
|
|
|
# enter image |
|
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun |
|
|
|
# clone code |
|
git clone https://github.com/aigc-apps/CogVideoX-Fun.git |
|
|
|
# enter CogVideoX-Fun's dir |
|
cd CogVideoX-Fun |
|
|
|
# download weights |
|
mkdir models/Diffusion_Transformer |
|
mkdir models/Personalized_Model |
|
|
|
# Please use the hugginface link or modelscope link to download the model. |
|
# CogVideoX-Fun |
|
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP |
|
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP |
|
|
|
# Wan |
|
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP |
|
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP |
|
``` |
|
|
|
### 2. Local install: Environment Check/Downloading/Installation |
|
#### a. Environment Check |
|
We have verified this repo execution on the following environment: |
|
|
|
The detailed of Windows: |
|
- OS: Windows 10 |
|
- python: python3.10 & python3.11 |
|
- pytorch: torch2.2.0 |
|
- CUDA: 11.8 & 12.1 |
|
- CUDNN: 8+ |
|
- GPU: Nvidia-3060 12G & Nvidia-3090 24G |
|
|
|
The detailed of Linux: |
|
- OS: Ubuntu 20.04, CentOS |
|
- python: python3.10 & python3.11 |
|
- pytorch: torch2.2.0 |
|
- CUDA: 11.8 & 12.1 |
|
- CUDNN: 8+ |
|
- GPU:Nvidia-V100 16G & Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G |
|
|
|
We need about 60GB available on disk (for saving weights), please check! |
|
|
|
#### b. Weights |
|
We'd better place the [weights](#model-zoo) along the specified path: |
|
|
|
``` |
|
📦 models/ |
|
├── 📂 Diffusion_Transformer/ |
|
│ ├── 📂 CogVideoX-Fun-V1.1-2b-InP/ |
|
│ ├── 📂 CogVideoX-Fun-V1.1-5b-InP/ |
|
│ ├── 📂 Wan2.1-Fun-14B-InP |
|
│ └── 📂 Wan2.1-Fun-1.3B-InP/ |
|
├── 📂 Personalized_Model/ |
|
│ └── your trained trainformer model / your trained lora model (for UI load) |
|
``` |
|
|
|
# How to Use |
|
|
|
<h3 id="video-gen">1. Generation</h3> |
|
|
|
#### a. GPU Memory Optimization |
|
Since Wan2.1 has a very large number of parameters, we need to consider memory optimization strategies to adapt to consumer-grade GPUs. We provide `GPU_memory_mode` for each prediction file, allowing you to choose between `model_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `sequential_cpu_offload`. This solution is also applicable to CogVideoX-Fun generation. |
|
|
|
- `model_cpu_offload`: The entire model is moved to the CPU after use, saving some GPU memory. |
|
- `model_cpu_offload_and_qfloat8`: The entire model is moved to the CPU after use, and the transformer model is quantized to float8, saving more GPU memory. |
|
- `sequential_cpu_offload`: Each layer of the model is moved to the CPU after use. It is slower but saves a significant amount of GPU memory. |
|
|
|
`qfloat8` may slightly reduce model performance but saves more GPU memory. If you have sufficient GPU memory, it is recommended to use `model_cpu_offload`. |
|
|
|
#### b. Using ComfyUI |
|
For details, refer to [ComfyUI README](comfyui/README.md). |
|
|
|
#### c. Running Python Files |
|
- **Step 1**: Download the corresponding [weights](#model-zoo) and place them in the `models` folder. |
|
- **Step 2**: Use different files for prediction based on the weights and prediction goals. This library currently supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun. Different models are distinguished by folder names under the `examples` folder, and their supported features vary. Use them accordingly. Below is an example using CogVideoX-Fun: |
|
- **Text-to-Video**: |
|
- Modify `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_t2v.py`. |
|
- Run the file `examples/cogvideox_fun/predict_t2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos`. |
|
- **Image-to-Video**: |
|
- Modify `validation_image_start`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_i2v.py`. |
|
- `validation_image_start` is the starting image of the video, and `validation_image_end` is the ending image of the video. |
|
- Run the file `examples/cogvideox_fun/predict_i2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_i2v`. |
|
- **Video-to-Video**: |
|
- Modify `validation_video`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_v2v.py`. |
|
- `validation_video` is the reference video for video-to-video generation. You can use the following demo video: [Demo Video](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1/play_guitar.mp4). |
|
- Run the file `examples/cogvideox_fun/predict_v2v.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_v2v`. |
|
- **Controlled Video Generation (Canny, Pose, Depth, etc.)**: |
|
- Modify `control_video`, `validation_image_end`, `prompt`, `neg_prompt`, `guidance_scale`, and `seed` in the file `examples/cogvideox_fun/predict_v2v_control.py`. |
|
- `control_video` is the control video extracted using operators such as Canny, Pose, or Depth. You can use the following demo video: [Demo Video](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1.1/pose.mp4). |
|
- Run the file `examples/cogvideox_fun/predict_v2v_control.py` and wait for the results. The generated videos will be saved in the folder `samples/cogvideox-fun-videos_v2v_control`. |
|
- **Step 3**: If you want to integrate other backbones or Loras trained by yourself, modify `lora_path` and relevant paths in `examples/{model_name}/predict_t2v.py` or `examples/{model_name}/predict_i2v.py` as needed. |
|
|
|
#### d. Using the Web UI |
|
The web UI supports text-to-video, image-to-video, video-to-video, and controlled video generation (Canny, Pose, Depth, etc.). This library currently supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun. Different models are distinguished by folder names under the `examples` folder, and their supported features vary. Use them accordingly. Below is an example using CogVideoX-Fun: |
|
|
|
- **Step 1**: Download the corresponding [weights](#model-zoo) and place them in the `models` folder. |
|
- **Step 2**: Run the file `examples/cogvideox_fun/app.py` to access the Gradio interface. |
|
- **Step 3**: Select the generation model on the page, fill in `prompt`, `neg_prompt`, `guidance_scale`, and `seed`, click "Generate," and wait for the results. The generated videos will be saved in the `sample` folder. |
|
|
|
# Reference |
|
- CogVideo: https://github.com/THUDM/CogVideo/ |
|
- EasyAnimate: https://github.com/aigc-apps/EasyAnimate |
|
- Wan2.1: https://github.com/Wan-Video/Wan2.1/ |
|
|
|
# License |
|
This project is licensed under the [Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE). |