nielsr HF Staff commited on
Commit
37c5ed4
·
verified ·
1 Parent(s): bd3b45a

Improve model card with pipeline tag, library name, and license clarification

Browse files

This PR enhances the model card by:

- Adding the `pipeline_tag: text-to-video` to improve discoverability on the Hugging Face Hub.
- Specifying the `library_name: diffusers` to clarify the model's compatibility.
- Correcting and clarifying the license to Apache 2.0 based on the most up-to-date information in the Github README.


This ensures the model card is more informative and user-friendly.

Files changed (1) hide show
  1. README.md +71 -190
README.md CHANGED
@@ -1,13 +1,14 @@
1
  ---
2
- license: other
3
- license_link: https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE
4
  language:
5
- - en
 
 
6
  tags:
7
- - video-generation
8
- - thudm
9
- - image-to-video
10
  inference: false
 
11
  ---
12
 
13
  # CogVideoX1.5-5B
@@ -23,224 +24,104 @@ inference: false
23
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
24
  </p>
25
  <p align="center">
26
- 📍 Visit <a href="https://chatglm.cn/video?fr=osm_cogvideox"> Qingying </a> and the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API Platform </a> to experience the commercial video generation model
27
  </p>
28
 
29
  ## Model Introduction
30
 
31
- CogVideoX is an open-source video generation model similar to [QingYing](https://chatglm.cn/video?fr=osm_cogvideo).
32
- Below is a table listing information on the video generation models available in this generation:
33
-
34
 
35
  <table style="border-collapse: collapse; width: 100%;">
36
  <tr>
37
  <th style="text-align: center;">Model Name</th>
38
- <th style="text-align: center;">CogVideoX1.5-5B (Current Repository)</th>
39
- <th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
 
 
 
 
 
 
 
 
 
 
 
40
  </tr>
41
  <tr>
42
  <td style="text-align: center;">Video Resolution</td>
43
  <td colspan="1" style="text-align: center;">1360 * 768</td>
44
  <td colspan="1" style="text-align: center;"> Min(W, H) = 768 <br> 768 ≤ Max(W, H) ≤ 1360 <br> Max(W, H) % 16 = 0 </td>
45
- </tr>
 
 
 
 
 
 
46
  <tr>
47
  <td style="text-align: center;">Inference Precision</td>
48
- <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported INT4</td>
 
 
49
  </tr>
50
  <tr>
51
- <td style="text-align: center;">Single GPU Inference Memory Consumption</td>
52
- <td colspan="2" style="text-align: center;"><b>BF16: 9GB minimum*</b></td>
 
 
53
  </tr>
54
  <tr>
55
- <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
56
- <td colspan="2" style="text-align: center;"><b>BF16: 24GB* </b><br></td>
 
 
57
  </tr>
58
  <tr>
59
- <td style="text-align: center;">Inference Speed<br>(Step = 50, BF16)</td>
60
  <td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td>
 
 
61
  </tr>
62
  <tr>
63
  <td style="text-align: center;">Prompt Language</td>
64
  <td colspan="5" style="text-align: center;">English*</td>
65
  </tr>
66
  <tr>
67
- <td style="text-align: center;">Max Prompt Length</td>
68
  <td colspan="2" style="text-align: center;">224 Tokens</td>
 
69
  </tr>
70
  <tr>
71
  <td style="text-align: center;">Video Length</td>
72
- <td colspan="2" style="text-align: center;">5 or 10 seconds</td>
 
73
  </tr>
74
  <tr>
75
- <td style="text-align: center;">Frame Rate</td>
76
- <td colspan="2" style="text-align: center;">16 frames/second</td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  </tr>
78
  </table>
79
 
80
- **Data Explanation**
81
-
82
- + Testing with the `diffusers` library enabled all optimizations included in the library. This scheme has not been
83
- tested on non-NVIDIA A100/H100 devices. It should generally work with all NVIDIA Ampere architecture or higher
84
- devices. Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable
85
- certain optimizations, including:
86
-
87
- ```
88
- pipe.enable_sequential_cpu_offload()
89
- pipe.vae.enable_slicing()
90
- pipe.vae.enable_tiling()
91
- ```
92
-
93
- + In multi-GPU inference, `enable_sequential_cpu_offload()` optimization needs to be disabled.
94
- + Using an INT8 model reduces inference speed, meeting the requirements of lower VRAM GPUs while retaining minimal video
95
- quality degradation, at the cost of significant speed reduction.
96
- + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
97
- used to quantize the text encoder, Transformer, and VAE modules, reducing CogVideoX’s memory requirements, making it
98
- feasible to run the model on smaller VRAM GPUs. TorchAO quantization is fully compatible with `torch.compile`,
99
- significantly improving inference speed. `FP8` precision is required for NVIDIA H100 and above, which requires source
100
- installation of `torch`, `torchao`, `diffusers`, and `accelerate`. Using `CUDA 12.4` is recommended.
101
- + Inference speed testing also used the above VRAM optimizations, and without optimizations, speed increases by about
102
- 10%. Only `diffusers` versions of models support quantization.
103
- + Models support English input only; other languages should be translated into English during prompt crafting with a
104
- larger model.
105
-
106
- **Note**
107
-
108
- + Use [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning SAT version models. Check our
109
- GitHub for more details.
110
-
111
- ## Getting Started Quickly 🤗
112
-
113
- This model supports deployment using the Hugging Face diffusers library. You can follow the steps below to get started.
114
-
115
- **We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) to check out prompt optimization and
116
- conversion to get a better experience.**
117
-
118
- 1. Install the required dependencies
119
-
120
- ```shell
121
- # diffusers (from source)
122
- # transformers>=4.46.2
123
- # accelerate>=1.1.1
124
- # imageio-ffmpeg>=0.5.1
125
- pip install git+https://github.com/huggingface/diffusers
126
- pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
127
- ```
128
-
129
- 2. Run the code
130
-
131
- ```python
132
- import torch
133
- from diffusers import CogVideoXPipeline
134
- from diffusers.utils import export_to_video
135
-
136
- prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
137
-
138
- pipe = CogVideoXPipeline.from_pretrained(
139
- "THUDM/CogVideoX1.5-5B",
140
- torch_dtype=torch.bfloat16
141
- )
142
-
143
- pipe.enable_sequential_cpu_offload()
144
- pipe.vae.enable_tiling()
145
- pipe.vae.enable_slicing()
146
-
147
- video = pipe(
148
- prompt=prompt,
149
- num_videos_per_prompt=1,
150
- num_inference_steps=50,
151
- num_frames=81,
152
- guidance_scale=6,
153
- generator=torch.Generator(device="cuda").manual_seed(42),
154
- ).frames[0]
155
-
156
- export_to_video(video, "output.mp4", fps=8)
157
- ```
158
-
159
- ## Quantized Inference
160
-
161
- [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
162
- used to quantize the text encoder, transformer, and VAE modules to reduce CogVideoX's memory requirements. This allows
163
- the model to run on free T4 Colab or GPUs with lower VRAM! Also, note that TorchAO quantization is fully compatible
164
- with `torch.compile`, which can significantly accelerate inference.
165
-
166
- ```python
167
- # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
168
- # Source and nightly installation is only required until the next release.
169
-
170
- import torch
171
- from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
172
- from diffusers.utils import export_to_video
173
- from transformers import T5EncoderModel
174
- from torchao.quantization import quantize_, int8_weight_only
175
-
176
- quantization = int8_weight_only
177
-
178
- text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="text_encoder",
179
- torch_dtype=torch.bfloat16)
180
- quantize_(text_encoder, quantization())
181
-
182
- transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="transformer",
183
- torch_dtype=torch.bfloat16)
184
- quantize_(transformer, quantization())
185
-
186
- vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="vae", torch_dtype=torch.bfloat16)
187
- quantize_(vae, quantization())
188
-
189
- # Create pipeline and run inference
190
- pipe = CogVideoXImageToVideoPipeline.from_pretrained(
191
- "THUDM/CogVideoX1.5-5B",
192
- text_encoder=text_encoder,
193
- transformer=transformer,
194
- vae=vae,
195
- torch_dtype=torch.bfloat16,
196
- )
197
-
198
- pipe.enable_model_cpu_offload()
199
- pipe.vae.enable_tiling()
200
- pipe.vae.enable_slicing()
201
-
202
- prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
203
- video = pipe(
204
- prompt=prompt,
205
- num_videos_per_prompt=1,
206
- num_inference_steps=50,
207
- num_frames=81,
208
- guidance_scale=6,
209
- generator=torch.Generator(device="cuda").manual_seed(42),
210
- ).frames[0]
211
-
212
- export_to_video(video, "output.mp4", fps=8)
213
- ```
214
-
215
- Additionally, these models can be serialized and stored using PytorchAO in quantized data types to save disk space. You
216
- can find examples and benchmarks at the following links:
217
-
218
- - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
219
- - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
220
-
221
- ## Further Exploration
222
-
223
- Feel free to enter our [GitHub](https://github.com/THUDM/CogVideo), where you'll find:
224
-
225
- 1. More detailed technical explanations and code.
226
- 2. Optimized prompt examples and conversions.
227
- 3. Detailed code for model inference and fine-tuning.
228
- 4. Project update logs and more interactive opportunities.
229
- 5. CogVideoX toolchain to help you better use the model.
230
- 6. INT8 model inference code.
231
-
232
- ## Model License
233
-
234
- This model is released under the [CogVideoX LICENSE](LICENSE).
235
-
236
- ## Citation
237
-
238
- ```
239
- @article{yang2024cogvideox,
240
- title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
241
- author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
242
- journal={arXiv preprint arXiv:2408.06072},
243
- year={2024}
244
- }
245
- ```
246
-
 
1
  ---
 
 
2
  language:
3
+ - en
4
+ license: apache-2.0
5
+ pipeline_tag: text-to-video
6
  tags:
7
+ - video-generation
8
+ - thudm
9
+ - image-to-video
10
  inference: false
11
+ library_name: diffusers
12
  ---
13
 
14
  # CogVideoX1.5-5B
 
24
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
25
  </p>
26
  <p align="center">
27
+ 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
28
  </p>
29
 
30
  ## Model Introduction
31
 
32
+ CogVideoX is an open-source video generation model similar to [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
 
 
33
 
34
  <table style="border-collapse: collapse; width: 100%;">
35
  <tr>
36
  <th style="text-align: center;">Model Name</th>
37
+ <th style="text-align: center;">CogVideoX1.5-5B (Latest)</th>
38
+ <th style="text-align: center;">CogVideoX1.5-5B-I2V (Latest)</th>
39
+ <th style="text-align: center;">CogVideoX-2B</th>
40
+ <th style="text-align: center;">CogVideoX-5B</th>
41
+ <th style="text-align: center;">CogVideoX-5B-I2V</th>
42
+ </tr>
43
+ <tr>
44
+ <td style="text-align: center;">Release Date</td>
45
+ <th style="text-align: center;">November 8, 2024</th>
46
+ <th style="text-align: center;">November 8, 2024</th>
47
+ <th style="text-align: center;">August 6, 2024</th>
48
+ <th style="text-align: center;">August 27, 2024</th>
49
+ <th style="text-align: center;">September 19, 2024</th>
50
  </tr>
51
  <tr>
52
  <td style="text-align: center;">Video Resolution</td>
53
  <td colspan="1" style="text-align: center;">1360 * 768</td>
54
  <td colspan="1" style="text-align: center;"> Min(W, H) = 768 <br> 768 ≤ Max(W, H) ≤ 1360 <br> Max(W, H) % 16 = 0 </td>
55
+ <td colspan="3" style="text-align: center;">720 * 480</td>
56
+ </tr>
57
+ <tr>
58
+ <td style="text-align: center;">Number of Frames</td>
59
+ <td colspan="2" style="text-align: center;">Should be <b>16N + 1</b> where N <= 10 (default 81)</td>
60
+ <td colspan="3" style="text-align: center;">Should be <b>8N + 1</b> where N <= 6 (default 49)</td>
61
+ </tr>
62
  <tr>
63
  <td style="text-align: center;">Inference Precision</td>
64
+ <td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
65
+ <td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*, INT8, Not supported: INT4</td>
66
+ <td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
67
  </tr>
68
  <tr>
69
+ <td style="text-align: center;">Single GPU Memory Usage<br></td>
70
+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16: from 10GB*</b><br><b>diffusers INT8(torchao): from 7GB*</b></td>
71
+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB minimum* </b><br><b>diffusers INT8 (torchao): 3.6GB minimum*</b></td>
72
+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB minimum* </b><br><b>diffusers INT8 (torchao): 4.4GB minimum* </b></td>
73
  </tr>
74
  <tr>
75
+ <td style="text-align: center;">Multi-GPU Memory Usage</td>
76
+ <td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
77
+ <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
78
+ <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
79
  </tr>
80
  <tr>
81
+ <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
82
  <td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td>
83
+ <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
84
+ <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
85
  </tr>
86
  <tr>
87
  <td style="text-align: center;">Prompt Language</td>
88
  <td colspan="5" style="text-align: center;">English*</td>
89
  </tr>
90
  <tr>
91
+ <td style="text-align: center;">Prompt Token Limit</td>
92
  <td colspan="2" style="text-align: center;">224 Tokens</td>
93
+ <td colspan="3" style="text-align: center;">226 Tokens</td>
94
  </tr>
95
  <tr>
96
  <td style="text-align: center;">Video Length</td>
97
+ <td colspan="2" style="text-align: center;">5 seconds or 10 seconds</td>
98
+ <td colspan="3" style="text-align: center;">6 seconds</td>
99
  </tr>
100
  <tr>
101
+ <td style="text-align: center;">Frame Rate</td>
102
+ <td colspan="2" style="text-align: center;">16 frames / second </td>
103
+ <td colspan="3" style="text-align: center;">8 frames / second </td>
104
+ </tr>
105
+ <tr>
106
+ <td style="text-align: center;">Position Encoding</td>
107
+ <td colspan="2" style="text-align: center;">3d_rope_pos_embed</td>
108
+ <td style="text-align: center;">3d_sincos_pos_embed</td>
109
+ <td style="text-align: center;">3d_rope_pos_embed</td>
110
+ <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
111
+ </tr>
112
+ <tr>
113
+ <td style="text-align: center;">Download Link (Diffusers)</td>
114
+ <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
115
+ <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
116
+ <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
117
+ <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
118
+ <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
119
+ </tr>
120
+ <tr>
121
+ <td style="text-align: center;">Download Link (SAT)</td>
122
+ <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
123
+ <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
124
  </tr>
125
  </table>
126
 
127
+ **(rest of the content remains the same as the original)**