TIGER-Lab
/

VideoScore-anno-only

@@ -1,58 +1,172 @@
 ---
 license: apache-2.0
-base_model: TIGER-Lab/Mantis-8B-Idefics2
-tags:
-- generated_from_trainer
-model-index:
-- name: mantis-8b-idefics2-video-eval-refined-40k-ablation-anno_4096_regression
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# mantis-8b-idefics2-video-eval-refined-40k-ablation-anno_4096_regression
-This model is a fine-tuned version of [TIGER-Lab/Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 8
-- total_train_batch_size: 64
-- total_eval_batch_size: 8
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.03
-- num_epochs: 1.0
-### Training results
-### Framework versions
-- Transformers 4.41.1
-- Pytorch 2.3.0+cu121
-- Datasets 2.18.0
-- Tokenizers 0.19.1

 ---
 license: apache-2.0
+datasets:
+- TIGER-Lab/VideoFeedback
+language:
+- en
+metrics:
+- accuracy/spcc
+library_name: transformers
+pipeline_tag: visual-question-answering
 ---
+[📃Paper] | [🌐Website](https://tiger-ai-lab.github.io/MantisScore/) | [💻Github](https://github.com/TIGER-AI-Lab/MantisScore) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback) | [🤗Model](https://huggingface.co/TIGER-Lab/MantisScore) | [🤗Model-variant](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) | [🤗Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore)
+![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png)
+## Introduction
+- MantisScore-anno-only is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
+and trained on [VideoFeedback](https://huggingface.co/datasets/TIGER-Lab/VideoFeedback),
+a large video evaluation dataset with multi-aspect human scores, with the real videos excluded (read more in VideoFeedback Dataset card).
+- MantisScore can reach 75+ Spearman correlation with humans on VideoFeedback-test, surpassing all the MLLM-prompting methods and feature-based metrics.
+- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.
+## Evaluation Results
+We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
+For the first two benchmarks, we take Spearman corrleation between model's output and human ratings
+averaged among all the evaluation aspects as indicator.
+For GenAI-Bench and VBench, which include human preference data among two or more videos,
+we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
+Moreover, we use [MantisScore](https://huggingface.co/TIGER-Lab/MantisScore) trained on VideoFeedback dataset
+for VideoFeedback-test set, while for other three benchmarks, we use
+[MantisScore-anno-only](https://huggingface.co/TIGER-Lab/MantisScore-anno-only) variant trained on VideoFeedback dataset
+with real videos excluded.
+The evaluation results are shown below:
+| metric            | Final Sum Score | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench     |
+|:-----------------:|:---------------:|:--------------:|:-----------:|:-----------:|:----------:|
+| MantisScore (reg) |       **278.3** |           75.7 |    **51.1** |    **78.5** |   **73.0** |
+| MantisScore (gen) |           222.4 |       **77.1** |        27.6 |        59.0 |       58.7 |
+| Gemini-1.5-Pro    |    <u>158.8</u> |           22.1 |        22.9 |        60.9 |       52.9 |
+| Gemini-1.5-Flash  |           157.5 |           20.8 |        17.3 | <u>67.1</u> |       52.3 |
+| GPT-4o            |           155.4 |    <u>23.1</u> |        28.7 |        52.0 |       51.7 |
+| CLIP-sim          |           126.8 |            8.9 | <u>36.2</u> |        34.2 |       47.4 |
+| DINO-sim          |           121.3 |            7.5 |        32.1 |        38.5 |       43.3 |
+| SSIM-sim          |           118.0 |           13.4 |        26.9 |        34.1 |       43.5 |
+| CLIP-Score        |           114.4 |           -7.2 |        21.7 |        45.0 |       54.9 |
+| LLaVA-1.5-7B      |           108.3 |            8.5 |        10.5 |        49.9 |       39.4 |
+| LLaVA-1.6-7B      |            93.3 |           -3.1 |        13.2 |        44.5 |       38.7 |
+| X-CLIP-Score      |            92.9 |           -1.9 |        13.3 |        41.4 |       40.1 |
+| PIQE              |            78.3 |          -10.1 |        -1.2 |        34.5 |<u> 55.1</u>|
+| BRISQUE           |            75.9 |          -20.3 |         3.9 |        38.5 |       53.7 |
+| Idefics2          |            73.0 |            6.5 |         0.3 |        34.6 |       31.7 |
+| SSIM-dyn          |            42.5 |           -5.5 |       -17.0 |        28.4 |       36.5 |
+| MES-dyn           |            36.7 |          -12.9 |       -26.4 |        31.4 |       44.5 |
+| Fuyu              |               - |              - |           - |           - |          - |
+| Kosmos-2          |               - |              - |           - |           - |          - |
+| CogVLM            |               - |              - |           - |           - |          - |
+| OpenFlamingo      |               - |              - |           - |           - |          - |
+The best in MantisScore series is in bold and the best in baselines is underlined.
+"-" means the answer of MLLM is meaningless or in wrong format.
+## Usage
+### Installation
+```bash
+pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git
+```
+### Inference
+```python
+import av
+import numpy as np
+def _read_video_pyav(
+    frame_paths:List[str],
+    max_frames:int,
+):
+    frames = []
+    container.seek(0)
+    start_index = indices[0]
+    end_index = indices[-1]
+    for i, frame in enumerate(container.decode(video=0)):
+        if i > end_index:
+            break
+        if i >= start_index and i in indices:
+            frames.append(frame)
+    return np.stack([x.to_ndarray(format="rgb24") for x in frames])
+MAX_NUM_FRAMES=16
+REGRESSION_QUERY_PROMPT = """
+Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
+please watch the following frames of a given video and see the text prompt for generating the video,
+then give scores from 5 different dimensions:
+(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color
+(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements
+(3) dynamic degree, the degree of dynamic changes
+(4) text-to-video alignment, the alignment between the text prompt and the video content
+(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge
+for each dimension, output a float number from 1.0 to 4.0,
+the higher the number is, the better the video performs in that sub-score,
+the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video)
+Here is an output example:
+visual quality: 3.2
+temporal consistency: 2.7
+dynamic degree: 4.0
+text-to-video alignment: 2.3
+factual consistency: 1.8
+For this video, the text prompt is "{text_prompt}",
+all the frames of video are as follows:
+"""
+video_path="examples/video1.mp4"
+# sample uniformly 8 frames from the video
+container = av.open(video_path)
+total_frames = container.streams.video[0].frames
+if total_frames > MAX_NUM_FRAMES:
+    indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
+else:
+    indices = np.arange(total_frames)
+frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
+eval_prompt = REGRESSION_QUERY_TEMPLATE.format(text_prompt=video_prompt)
+num_image_token = eval_prompt.count("<image>")
+if num_image_token < len(frames):
+    eval_prompt += "<image> " * (len(frames) - num_image_token)
+flatten_images = []
+for x in [frames]:
+    if isinstance(x, list):
+        flatten_images.extend(x)
+    else:
+        flatten_images.append(x)
+flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
+inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits
+num_aspects = logits.shape[-1]
+aspect_scores = []
+for i in range(num_aspects):
+    aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))
+print(aspect_scores)
+"""
+# model output on visual quality, temporal consistency, dynamic degree,
+# text-to-video alignment, factual consistency, respectively
+[2.2969, 2.4375, 2.8281, 2.5, 2.4688]
+"""
+```
+### Training
+see [MantisScore/training](https://github.com/TIGER-AI-Lab/MantisScore/training) for details
+### Evaluation
+see [MantisScore/benchmark]((https://github.com/TIGER-AI-Lab/MantisScore/benchmark)) for details
+## Citation