Update README.md
Browse files
README.md
CHANGED
@@ -1,58 +1,172 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
|
|
|
|
9 |
---
|
10 |
|
11 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
12 |
-
should probably proofread and complete it, then remove this comment. -->
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
-
|
37 |
-
-
|
38 |
-
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
-
|
56 |
-
-
|
57 |
-
-
|
58 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- TIGER-Lab/VideoFeedback
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- accuracy/spcc
|
9 |
+
library_name: transformers
|
10 |
+
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
|
|
|
|
13 |
|
14 |
+
[📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore)
|
15 |
+
|
16 |
+
|
17 |
+

|
18 |
+
|
19 |
+
## Introduction
|
20 |
+
- MantisScore-anno-only is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
|
21 |
+
and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
|
22 |
+
a large video evaluation dataset with multi-aspect human scores, with the real videos excluded (read more in VideoFeedback Dataset card).
|
23 |
+
|
24 |
+
- MantisScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
|
25 |
+
|
26 |
+
- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
|
27 |
+
|
28 |
+
## Evaluation Results
|
29 |
+
|
30 |
+
We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
|
31 |
+
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
|
32 |
+
averaged among all the evaluation aspects as indicator.
|
33 |
+
For GenAI-Bench and VBench, which include human preference data among two or more videos,
|
34 |
+
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
|
35 |
+
|
36 |
+
Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset
|
37 |
+
for VideoFeedback-test set, while for other three benchmarks, we use
|
38 |
+
[MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset
|
39 |
+
with real videos excluded.
|
40 |
+
|
41 |
+
The evaluation results are shown below:
|
42 |
+
|
43 |
+
|
44 |
+
| metric | Final Sum Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench |
|
45 |
+
|:-----------------:|:---------------:|:--------------:|:-----------:|:-----------:|:----------:|
|
46 |
+
| MantisScore (reg) | **278.3** | 75.7 | **51.1** | **78.5** | **73.0** |
|
47 |
+
| MantisScore (gen) | 222.4 | **77.1** | 27.6 | 59.0 | 58.7 |
|
48 |
+
| Gemini-1.5-Pro | <u>158.8</u> | 22.1 | 22.9 | 60.9 | 52.9 |
|
49 |
+
| Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | <u>67.1</u> | 52.3 |
|
50 |
+
| GPT-4o | 155.4 | <u>23.1</u> | 28.7 | 52.0 | 51.7 |
|
51 |
+
| CLIP-sim | 126.8 | 8.9 | <u>36.2</u> | 34.2 | 47.4 |
|
52 |
+
| DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 |
|
53 |
+
| SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 |
|
54 |
+
| CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 |
|
55 |
+
| LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 |
|
56 |
+
| LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 |
|
57 |
+
| X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 |
|
58 |
+
| PIQE | 78.3 | -10.1 | -1.2 | 34.5 |<u> 55.1</u>|
|
59 |
+
| BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 |
|
60 |
+
| Idefics2 | 73.0 | 6.5 | 0.3 | 34.6 | 31.7 |
|
61 |
+
| SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 |
|
62 |
+
| MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 |
|
63 |
+
| Fuyu | - | - | - | - | - |
|
64 |
+
| Kosmos-2 | - | - | - | - | - |
|
65 |
+
| CogVLM | - | - | - | - | - |
|
66 |
+
| OpenFlamingo | - | - | - | - | - |
|
67 |
+
|
68 |
+
The best in MantisScore series is in bold and the best in baselines is underlined.
|
69 |
+
"-" means the answer of MLLM is meaningless or in wrong format.
|
70 |
+
|
71 |
+
## Usage
|
72 |
+
### Installation
|
73 |
+
```bash
|
74 |
+
pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git
|
75 |
+
```
|
76 |
+
|
77 |
+
### Inference
|
78 |
+
```python
|
79 |
+
import av
|
80 |
+
import numpy as np
|
81 |
+
def _read_video_pyav(
|
82 |
+
frame_paths:List[str],
|
83 |
+
max_frames:int,
|
84 |
+
):
|
85 |
+
frames = []
|
86 |
+
container.seek(0)
|
87 |
+
start_index = indices[0]
|
88 |
+
end_index = indices[-1]
|
89 |
+
for i, frame in enumerate(container.decode(video=0)):
|
90 |
+
if i > end_index:
|
91 |
+
break
|
92 |
+
if i >= start_index and i in indices:
|
93 |
+
frames.append(frame)
|
94 |
+
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
|
95 |
+
|
96 |
+
MAX_NUM_FRAMES=16
|
97 |
+
REGRESSION_QUERY_PROMPT = """
|
98 |
+
Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
|
99 |
+
please watch the following frames of a given video and see the text prompt for generating the video,
|
100 |
+
then give scores from 5 different dimensions:
|
101 |
+
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color
|
102 |
+
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements
|
103 |
+
(3) dynamic degree, the degree of dynamic changes
|
104 |
+
(4) text-to-video alignment, the alignment between the text prompt and the video content
|
105 |
+
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge
|
106 |
+
|
107 |
+
for each dimension, output a float number from 1.0 to 4.0,
|
108 |
+
the higher the number is, the better the video performs in that sub-score,
|
109 |
+
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video)
|
110 |
+
Here is an output example:
|
111 |
+
visual quality: 3.2
|
112 |
+
temporal consistency: 2.7
|
113 |
+
dynamic degree: 4.0
|
114 |
+
text-to-video alignment: 2.3
|
115 |
+
factual consistency: 1.8
|
116 |
+
|
117 |
+
For this video, the text prompt is "{text_prompt}",
|
118 |
+
all the frames of video are as follows:
|
119 |
+
"""
|
120 |
+
|
121 |
+
video_path="examples/video1.mp4"
|
122 |
+
|
123 |
+
# sample uniformly 8 frames from the video
|
124 |
+
container = av.open(video_path)
|
125 |
+
total_frames = container.streams.video[0].frames
|
126 |
+
if total_frames > MAX_NUM_FRAMES:
|
127 |
+
indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
|
128 |
+
else:
|
129 |
+
indices = np.arange(total_frames)
|
130 |
+
|
131 |
+
frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
|
132 |
+
eval_prompt = REGRESSION_QUERY_TEMPLATE.format(text_prompt=video_prompt)
|
133 |
+
num_image_token = eval_prompt.count("<image>")
|
134 |
+
if num_image_token < len(frames):
|
135 |
+
eval_prompt += "<image> " * (len(frames) - num_image_token)
|
136 |
+
|
137 |
+
flatten_images = []
|
138 |
+
for x in [frames]:
|
139 |
+
if isinstance(x, list):
|
140 |
+
flatten_images.extend(x)
|
141 |
+
else:
|
142 |
+
flatten_images.append(x)
|
143 |
+
flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
|
144 |
+
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
|
145 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
146 |
+
|
147 |
+
with torch.no_grad():
|
148 |
+
outputs = model(**inputs)
|
149 |
+
|
150 |
+
logits = outputs.logits
|
151 |
+
num_aspects = logits.shape[-1]
|
152 |
+
|
153 |
+
aspect_scores = []
|
154 |
+
for i in range(num_aspects):
|
155 |
+
aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))
|
156 |
+
print(aspect_scores)
|
157 |
+
|
158 |
+
"""
|
159 |
+
# model output on visual quality, temporal consistency, dynamic degree,
|
160 |
+
# text-to-video alignment, factual consistency, respectively
|
161 |
+
[2.2969, 2.4375, 2.8281, 2.5, 2.4688]
|
162 |
+
"""
|
163 |
+
|
164 |
+
```
|
165 |
+
|
166 |
+
### Training
|
167 |
+
see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details
|
168 |
+
|
169 |
+
### Evaluation
|
170 |
+
see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details
|
171 |
+
|
172 |
+
## Citation
|