lixuejing commited on
Commit
5c4d8e2
·
1 Parent(s): b7e8d5e
Files changed (2) hide show
  1. app.py +2 -2
  2. src/about.py +19 -19
app.py CHANGED
@@ -446,14 +446,14 @@ with demo:
446
  base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
447
 
448
  with gr.Row():
449
- gr.Markdown("## ✉️✨ Submit your API infos here!", elem_classes="markdown-text")
450
  with gr.Row():
451
  model_url_textbox = gr.Textbox(label="Model online api url")
452
  model_api_key = gr.Textbox(label="Model online api key")
453
  model_api_name_textbox = gr.Textbox(label="Online api model name")
454
 
455
  with gr.Row():
456
- gr.Markdown("## ✉️✨ Submit your inference infos here!", elem_classes="markdown-text")
457
  with gr.Row():
458
  runsh = gr.File(label="upload run.sh file", file_types=[".sh"])
459
  adapter = gr.File(label="upload model_adapter.py file", file_types=[".py"])
 
446
  base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
447
 
448
  with gr.Row():
449
+ gr.Markdown("## ✉️✨ Submit your API infos here!(API only)", elem_classes="markdown-text")
450
  with gr.Row():
451
  model_url_textbox = gr.Textbox(label="Model online api url")
452
  model_api_key = gr.Textbox(label="Model online api key")
453
  model_api_name_textbox = gr.Textbox(label="Online api model name")
454
 
455
  with gr.Row():
456
+ gr.Markdown("## ✉️✨ Submit your inference infos here!(inference only)", elem_classes="markdown-text")
457
  with gr.Row():
458
  runsh = gr.File(label="upload run.sh file", file_types=[".sh"])
459
  adapter = gr.File(label="upload model_adapter.py file", file_types=[".py"])
src/about.py CHANGED
@@ -34,14 +34,12 @@ TITLE = """<h1 align="center" id="space-title">FlagEval-VLM Leaderboard</h1>"""
34
  # What does your leaderboard evaluate?
35
 
36
  INTRODUCTION_TEXT = """
37
- 欢迎使用FlagEval-VLM Leaderboard
38
  FlagEval-VLM Leaderboard 旨在跟踪、排名和评估开放式视觉大语言模型(VLM)。本排行榜由FlagEval平台提供相应算力和运行环境。VLM构建了一种基于数据集的能力体系,依据所接入的开源数据集,我们总结出了数学,视觉、图表、通用、文字以及中文等六个能力维度,由此组成一个评测集合。
39
- 如需对模型进行更全面的评测,可以登录 [FlagEval](https://flageval.baai.ac.cn/api/users/providers/hf)平台,体验更加完善的模型评测功能。
40
 
41
  Welcome to the FlagEval-VLM Leaderboard!
42
  The FlagEval-VLM Leaderboard is designed to track, rank and evaluate open Visual Large Language Models (VLMs). This leaderboard is powered by the FlagEval platform, which provides the appropriate arithmetic and runtime environment.
43
  VLM builds a dataset-based competency system. Based on the accessed open source datasets, we summarize six competency dimensions, including Mathematical, Visual, Graphical, Generic, Textual, and Chinese, to form a collection of assessments.
44
-
45
  """
46
  # Which evaluations are you running? how can people reproduce what you have?
47
  LLM_BENCHMARKS_TEXT = f"""
@@ -50,7 +48,7 @@ LLM_BENCHMARKS_TEXT = f"""
50
 
51
  感谢您积极的参与评测,在未来,我们会持续推动 FlagEval-VLM Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
52
 
53
- Thank you for your active participation in the evaluation. In the future, we will continue to promote FlagEval-VLM Leaderboard to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and fair list together.
54
 
55
  # Context
56
 
@@ -60,20 +58,20 @@ FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to
60
 
61
  ## How it works
62
 
63
- We evaluate models on 9 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
64
 
65
- - Chart_QA
 
66
  - CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
67
  - CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
68
  - MMMU -a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
69
- - MMMU_PRO_STANDARD - a more robust multi-discipline multimodal understanding benchmark.
70
- - MMMU_PRO_VISION
71
- - OCRBENCH- a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
72
- - MATH_VISION- a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
73
  - CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
74
- - Blink
75
 
76
- For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 
77
 
78
  ## Details and logs
79
  You can find:
@@ -82,15 +80,16 @@ You can find:
82
 
83
  ## Reproducibility
84
 
85
- To reproduce our results, here is the commands you can run, using this version of the Eleuther AI Harness: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
86
- The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
87
-
88
- Side note on the baseline scores:
89
- - for log-likelihood evaluation, we select the random baseline
90
- - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
 
 
91
 
92
  ## Icons
93
-
94
  - 🟢 : pretrained model: new, base models, trained on a given corpora
95
  - 🔶 : fine-tuned on domain-specific datasets model: pretrained models finetuned on more data
96
  - 💬 : chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
@@ -99,6 +98,7 @@ Side note on the baseline scores:
99
 
100
  ## Useful links
101
  - [Community resources](https://huggingface.co/spaces/BAAI/open_flageval_vlm_leaderboard/discussions)
 
102
 
103
  """
104
 
 
34
  # What does your leaderboard evaluate?
35
 
36
  INTRODUCTION_TEXT = """
37
+ 欢迎使用FlagEval-VLM Leaderboard
38
  FlagEval-VLM Leaderboard 旨在跟踪、排名和评估开放式视觉大语言模型(VLM)。本排行榜由FlagEval平台提供相应算力和运行环境。VLM构建了一种基于数据集的能力体系,依据所接入的开源数据集,我们总结出了数学,视觉、图表、通用、文字以及中文等六个能力维度,由此组成一个评测集合。
 
39
 
40
  Welcome to the FlagEval-VLM Leaderboard!
41
  The FlagEval-VLM Leaderboard is designed to track, rank and evaluate open Visual Large Language Models (VLMs). This leaderboard is powered by the FlagEval platform, which provides the appropriate arithmetic and runtime environment.
42
  VLM builds a dataset-based competency system. Based on the accessed open source datasets, we summarize six competency dimensions, including Mathematical, Visual, Graphical, Generic, Textual, and Chinese, to form a collection of assessments.
 
43
  """
44
  # Which evaluations are you running? how can people reproduce what you have?
45
  LLM_BENCHMARKS_TEXT = f"""
 
48
 
49
  感谢您积极的参与评测,在未来,我们会持续推动 FlagEval-VLM Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
50
 
51
+ Thanks for your active participation in the evaluation. In the future, we will continue to promote FlagEval-VLM Leaderboard to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and fair list together.
52
 
53
  # Context
54
 
 
58
 
59
  ## How it works
60
 
61
+ We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
62
 
63
+ - <a href="https://github.com/vis-nlp/ChartQA" target="_blank"> ChartQA </a> - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
64
+ - Blink- a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
65
  - CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
66
  - CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
67
  - MMMU -a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
68
+ - MMMU_Pro(standard & vision) - a more robust multi-discipline multimodal understanding benchmark.
69
+ - OCRBench- a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
70
+ - MathVision- a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
 
71
  - CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
 
72
 
73
+ For all these evaluations, a higher score is a better score.
74
+ Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
75
 
76
  ## Details and logs
77
  You can find:
 
80
 
81
  ## Reproducibility
82
 
83
+ An example of llava with vllm as backend:
84
+ flagevalmm --tasks tasks/mmmu/mmmu_val.py \
85
+ --exec model_zoo/vlm/api_model/model_adapter.py \
86
+ --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
87
+ --num-workers 8 \
88
+ --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
89
+ --backend vllm \
90
+ --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
91
 
92
  ## Icons
 
93
  - 🟢 : pretrained model: new, base models, trained on a given corpora
94
  - 🔶 : fine-tuned on domain-specific datasets model: pretrained models finetuned on more data
95
  - 💬 : chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
 
98
 
99
  ## Useful links
100
  - [Community resources](https://huggingface.co/spaces/BAAI/open_flageval_vlm_leaderboard/discussions)
101
+ - [FlagEvalMM](https://github.com/flageval-baai/FlagEvalMM)
102
 
103
  """
104