hexuan21 commited on
Commit
2fe9a47
·
verified ·
1 Parent(s): 36961a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -53
README.md CHANGED
@@ -1,58 +1,172 @@
1
  ---
2
  license: apache-2.0
3
- base_model: TIGER-Lab/Mantis-8B-Idefics2
4
- tags:
5
- - generated_from_trainer
6
- model-index:
7
- - name: mantis-8b-idefics2-video-eval-refined-40k-ablation-anno_4096_regression
8
- results: []
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
 
14
- # mantis-8b-idefics2-video-eval-refined-40k-ablation-anno_4096_regression
15
-
16
- This model is a fine-tuned version of [TIGER-Lab/Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) on an unknown dataset.
17
-
18
- ## Model description
19
-
20
- More information needed
21
-
22
- ## Intended uses & limitations
23
-
24
- More information needed
25
-
26
- ## Training and evaluation data
27
-
28
- More information needed
29
-
30
- ## Training procedure
31
-
32
- ### Training hyperparameters
33
-
34
- The following hyperparameters were used during training:
35
- - learning_rate: 1e-05
36
- - train_batch_size: 1
37
- - eval_batch_size: 1
38
- - seed: 42
39
- - distributed_type: multi-GPU
40
- - num_devices: 8
41
- - gradient_accumulation_steps: 8
42
- - total_train_batch_size: 64
43
- - total_eval_batch_size: 8
44
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_ratio: 0.03
47
- - num_epochs: 1.0
48
-
49
- ### Training results
50
-
51
-
52
-
53
- ### Framework versions
54
-
55
- - Transformers 4.41.1
56
- - Pytorch 2.3.0+cu121
57
- - Datasets 2.18.0
58
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - TIGER-Lab/VideoFeedback
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy/spcc
9
+ library_name: transformers
10
+ pipeline_tag: visual-question-answering
11
  ---
12
 
 
 
13
 
14
+ [📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore)
15
+
16
+
17
+ ![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png)
18
+
19
+ ## Introduction
20
+ - MantisScore-anno-only is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
21
+ and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
22
+ a large video evaluation dataset with multi-aspect human scores, with the real videos excluded (read more in VideoFeedback Dataset card).
23
+
24
+ - MantisScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
25
+
26
+ - MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
27
+
28
+ ## Evaluation Results
29
+
30
+ We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
31
+ For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
32
+ averaged among all the evaluation aspects as indicator.
33
+ For GenAI-Bench and VBench, which include human preference data among two or more videos,
34
+ we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
35
+
36
+ Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset
37
+ for VideoFeedback-test set, while for other three benchmarks, we use
38
+ [MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset
39
+ with real videos excluded.
40
+
41
+ The evaluation results are shown below:
42
+
43
+
44
+ | metric | Final Sum Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench |
45
+ |:-----------------:|:---------------:|:--------------:|:-----------:|:-----------:|:----------:|
46
+ | MantisScore (reg) | **278.3** | 75.7 | **51.1** | **78.5** | **73.0** |
47
+ | MantisScore (gen) | 222.4 | **77.1** | 27.6 | 59.0 | 58.7 |
48
+ | Gemini-1.5-Pro | <u>158.8</u> | 22.1 | 22.9 | 60.9 | 52.9 |
49
+ | Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | <u>67.1</u> | 52.3 |
50
+ | GPT-4o | 155.4 | <u>23.1</u> | 28.7 | 52.0 | 51.7 |
51
+ | CLIP-sim | 126.8 | 8.9 | <u>36.2</u> | 34.2 | 47.4 |
52
+ | DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 |
53
+ | SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 |
54
+ | CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 |
55
+ | LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 |
56
+ | LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 |
57
+ | X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 |
58
+ | PIQE | 78.3 | -10.1 | -1.2 | 34.5 |<u> 55.1</u>|
59
+ | BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 |
60
+ | Idefics2 | 73.0 | 6.5 | 0.3 | 34.6 | 31.7 |
61
+ | SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 |
62
+ | MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 |
63
+ | Fuyu | - | - | - | - | - |
64
+ | Kosmos-2 | - | - | - | - | - |
65
+ | CogVLM | - | - | - | - | - |
66
+ | OpenFlamingo | - | - | - | - | - |
67
+
68
+ The best in MantisScore series is in bold and the best in baselines is underlined.
69
+ "-" means the answer of MLLM is meaningless or in wrong format.
70
+
71
+ ## Usage
72
+ ### Installation
73
+ ```bash
74
+ pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git
75
+ ```
76
+
77
+ ### Inference
78
+ ```python
79
+ import av
80
+ import numpy as np
81
+ def _read_video_pyav(
82
+ frame_paths:List[str],
83
+ max_frames:int,
84
+ ):
85
+ frames = []
86
+ container.seek(0)
87
+ start_index = indices[0]
88
+ end_index = indices[-1]
89
+ for i, frame in enumerate(container.decode(video=0)):
90
+ if i > end_index:
91
+ break
92
+ if i >= start_index and i in indices:
93
+ frames.append(frame)
94
+ return np.stack([x.to_ndarray(format="rgb24") for x in frames])
95
+
96
+ MAX_NUM_FRAMES=16
97
+ REGRESSION_QUERY_PROMPT = """
98
+ Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
99
+ please watch the following frames of a given video and see the text prompt for generating the video,
100
+ then give scores from 5 different dimensions:
101
+ (1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color
102
+ (2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements
103
+ (3) dynamic degree, the degree of dynamic changes
104
+ (4) text-to-video alignment, the alignment between the text prompt and the video content
105
+ (5) factual consistency, the consistency of the video content with the common-sense and factual knowledge
106
+
107
+ for each dimension, output a float number from 1.0 to 4.0,
108
+ the higher the number is, the better the video performs in that sub-score,
109
+ the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video)
110
+ Here is an output example:
111
+ visual quality: 3.2
112
+ temporal consistency: 2.7
113
+ dynamic degree: 4.0
114
+ text-to-video alignment: 2.3
115
+ factual consistency: 1.8
116
+
117
+ For this video, the text prompt is "{text_prompt}",
118
+ all the frames of video are as follows:
119
+ """
120
+
121
+ video_path="examples/video1.mp4"
122
+
123
+ # sample uniformly 8 frames from the video
124
+ container = av.open(video_path)
125
+ total_frames = container.streams.video[0].frames
126
+ if total_frames > MAX_NUM_FRAMES:
127
+ indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
128
+ else:
129
+ indices = np.arange(total_frames)
130
+
131
+ frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
132
+ eval_prompt = REGRESSION_QUERY_TEMPLATE.format(text_prompt=video_prompt)
133
+ num_image_token = eval_prompt.count("<image>")
134
+ if num_image_token < len(frames):
135
+ eval_prompt += "<image> " * (len(frames) - num_image_token)
136
+
137
+ flatten_images = []
138
+ for x in [frames]:
139
+ if isinstance(x, list):
140
+ flatten_images.extend(x)
141
+ else:
142
+ flatten_images.append(x)
143
+ flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
144
+ inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
145
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
146
+
147
+ with torch.no_grad():
148
+ outputs = model(**inputs)
149
+
150
+ logits = outputs.logits
151
+ num_aspects = logits.shape[-1]
152
+
153
+ aspect_scores = []
154
+ for i in range(num_aspects):
155
+ aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))
156
+ print(aspect_scores)
157
+
158
+ """
159
+ # model output on visual quality, temporal consistency, dynamic degree,
160
+ # text-to-video alignment, factual consistency, respectively
161
+ [2.2969, 2.4375, 2.8281, 2.5, 2.4688]
162
+ """
163
+
164
+ ```
165
+
166
+ ### Training
167
+ see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details
168
+
169
+ ### Evaluation
170
+ see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details
171
+
172
+ ## Citation