|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- TIGER-Lab/VideoFeedback |
|
language: |
|
- en |
|
metrics: |
|
- accuracy/spcc |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
|
|
[📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore) |
|
|
|
|
|
 |
|
|
|
## Introduction |
|
- MantisScore-anno-only is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model |
|
and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback), |
|
a large video evaluation dataset with multi-aspect human scores, with the real videos excluded (read more in VideoFeedback Dataset card). |
|
|
|
- MantisScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics. |
|
|
|
- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations. |
|
|
|
## Evaluation Results |
|
|
|
We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench. |
|
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings |
|
averaged among all the evaluation aspects as indicator. |
|
For GenAI-Bench and VBench, which include human preference data among two or more videos, |
|
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator. |
|
|
|
Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset |
|
for VideoFeedback-test set, while for other three benchmarks, we use |
|
[MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset |
|
with real videos excluded. |
|
|
|
The evaluation results are shown below: |
|
|
|
|
|
| metric | Final Sum Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench | |
|
|:-----------------:|:---------------:|:--------------:|:-----------:|:-----------:|:----------:| |
|
| MantisScore (reg) | **278.3** | 75.7 | **51.1** | **78.5** | **73.0** | |
|
| MantisScore (gen) | 222.4 | **77.1** | 27.6 | 59.0 | 58.7 | |
|
| Gemini-1.5-Pro | <u>158.8</u> | 22.1 | 22.9 | 60.9 | 52.9 | |
|
| Gemini-1.5-Flash | 157.5 | 20.8 | 17.3 | <u>67.1</u> | 52.3 | |
|
| GPT-4o | 155.4 | <u>23.1</u> | 28.7 | 52.0 | 51.7 | |
|
| CLIP-sim | 126.8 | 8.9 | <u>36.2</u> | 34.2 | 47.4 | |
|
| DINO-sim | 121.3 | 7.5 | 32.1 | 38.5 | 43.3 | |
|
| SSIM-sim | 118.0 | 13.4 | 26.9 | 34.1 | 43.5 | |
|
| CLIP-Score | 114.4 | -7.2 | 21.7 | 45.0 | 54.9 | |
|
| LLaVA-1.5-7B | 108.3 | 8.5 | 10.5 | 49.9 | 39.4 | |
|
| LLaVA-1.6-7B | 93.3 | -3.1 | 13.2 | 44.5 | 38.7 | |
|
| X-CLIP-Score | 92.9 | -1.9 | 13.3 | 41.4 | 40.1 | |
|
| PIQE | 78.3 | -10.1 | -1.2 | 34.5 |<u> 55.1</u>| |
|
| BRISQUE | 75.9 | -20.3 | 3.9 | 38.5 | 53.7 | |
|
| Idefics2 | 73.0 | 6.5 | 0.3 | 34.6 | 31.7 | |
|
| SSIM-dyn | 42.5 | -5.5 | -17.0 | 28.4 | 36.5 | |
|
| MES-dyn | 36.7 | -12.9 | -26.4 | 31.4 | 44.5 | |
|
| Fuyu | - | - | - | - | - | |
|
| Kosmos-2 | - | - | - | - | - | |
|
| CogVLM | - | - | - | - | - | |
|
| OpenFlamingo | - | - | - | - | - | |
|
|
|
The best in MantisScore series is in bold and the best in baselines is underlined. |
|
"-" means the answer of MLLM is meaningless or in wrong format. |
|
|
|
## Usage |
|
### Installation |
|
```bash |
|
pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git |
|
``` |
|
|
|
### Inference |
|
```python |
|
import av |
|
import numpy as np |
|
from typing import List |
|
import torch |
|
from transformers import AutoProcessor |
|
from models.idefics2 import Idefics2ForSequenceClassification |
|
|
|
def _read_video_pyav( |
|
frame_paths:List[str], |
|
max_frames:int, |
|
): |
|
frames = [] |
|
container.seek(0) |
|
start_index = indices[0] |
|
end_index = indices[-1] |
|
for i, frame in enumerate(container.decode(video=0)): |
|
if i > end_index: |
|
break |
|
if i >= start_index and i in indices: |
|
frames.append(frame) |
|
return np.stack([x.to_ndarray(format="rgb24") for x in frames]) |
|
|
|
MAX_NUM_FRAMES=16 |
|
ROUND_DIGIT=3 |
|
REGRESSION_QUERY_PROMPT = """ |
|
Suppose you are an expert in judging and evaluating the quality of AI-generated videos, |
|
please watch the following frames of a given video and see the text prompt for generating the video, |
|
then give scores from 5 different dimensions: |
|
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color |
|
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements |
|
(3) dynamic degree, the degree of dynamic changes |
|
(4) text-to-video alignment, the alignment between the text prompt and the video content |
|
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge |
|
|
|
for each dimension, output a float number from 1.0 to 4.0, |
|
the higher the number is, the better the video performs in that sub-score, |
|
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video) |
|
Here is an output example: |
|
visual quality: 3.2 |
|
temporal consistency: 2.7 |
|
dynamic degree: 4.0 |
|
text-to-video alignment: 2.3 |
|
factual consistency: 1.8 |
|
|
|
For this video, the text prompt is "{text_prompt}", |
|
all the frames of video are as follows: |
|
""" |
|
|
|
model_name="TIGER-Lab/MantisScore-anno-only" |
|
video_path="examples/video1.mp4" |
|
video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense." |
|
|
|
processor = AutoProcessor.from_pretrained(model_name,torch_dtype=torch.bfloat16) |
|
model = Idefics2ForSequenceClassification.from_pretrained(model_name,torch_dtype=torch.bfloat16).eval() |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
# sample uniformly 8 frames from the video |
|
container = av.open(video_path) |
|
total_frames = container.streams.video[0].frames |
|
if total_frames > MAX_NUM_FRAMES: |
|
indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int) |
|
else: |
|
indices = np.arange(total_frames) |
|
|
|
frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)] |
|
eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt) |
|
num_image_token = eval_prompt.count("<image>") |
|
if num_image_token < len(frames): |
|
eval_prompt += "<image> " * (len(frames) - num_image_token) |
|
|
|
flatten_images = [] |
|
for x in [frames]: |
|
if isinstance(x, list): |
|
flatten_images.extend(x) |
|
else: |
|
flatten_images.append(x) |
|
flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images] |
|
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt") |
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
logits = outputs.logits |
|
num_aspects = logits.shape[-1] |
|
|
|
aspect_scores = [] |
|
for i in range(num_aspects): |
|
aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT)) |
|
print(aspect_scores) |
|
""" |
|
# model output on visual quality, temporal consistency, dynamic degree, |
|
# text-to-video alignment, factual consistency, respectively |
|
[2.453, 2.706, 2.468, 2.464, 2.572] |
|
""" |
|
|
|
``` |
|
|
|
### Training |
|
see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details |
|
|
|
### Evaluation |
|
see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details |
|
|
|
## Citation |
|
|