lixuejing
commited on
Commit
·
d1c7622
1
Parent(s):
64fd93c
update
Browse files- src/about.py +22 -7
src/about.py
CHANGED
@@ -45,14 +45,22 @@ VLM builds a dataset-based competency system. Based on the accessed open source
|
|
45 |
"""
|
46 |
# Which evaluations are you running? how can people reproduce what you have?
|
47 |
LLM_BENCHMARKS_TEXT = f"""
|
48 |
-
|
|
|
|
|
49 |
感谢您积极的参与评测,在未来,我们会持续推动 FlagEval-VLM Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
|
|
|
50 |
Thank you for your active participation in the evaluation. In the future, we will continue to promote FlagEval-VLM Leaderboard to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and fair list together.
|
51 |
-
|
|
|
52 |
FlagEval-VLM Leaderboard是视觉大语言排行榜,我们希望能够推动更加开放的生态,让视觉大语言模型开发者参与进来,为推动视觉大语言模型进步做出相应的贡献。 为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
|
|
|
53 |
FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to promote a more open ecosystem for visual large language model developers to participate and contribute accordingly to the advancement of visual large language models. To achieve the goal of fairness, all models are evaluated on the FlagEval platform using standardized GPUs and a unified environment to ensure fairness.
|
54 |
-
|
|
|
|
|
55 |
We evaluate models on 9 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
|
|
|
56 |
- Chart_QA
|
57 |
- CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
|
58 |
- CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
|
@@ -64,24 +72,31 @@ We evaluate models on 9 key benchmarks using the Eleuther AI Language Model Eval
|
|
64 |
- CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
|
65 |
- Blink
|
66 |
For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
67 |
-
|
|
|
68 |
You can find:
|
69 |
- detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/results
|
70 |
- community queries and running status in the requests Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/requests
|
71 |
-
|
|
|
|
|
72 |
To reproduce our results, here is the commands you can run, using this version of the Eleuther AI Harness: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
|
73 |
The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
|
74 |
|
75 |
Side note on the baseline scores:
|
76 |
- for log-likelihood evaluation, we select the random baseline
|
77 |
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
78 |
-
|
|
|
|
|
79 |
- 🟢 : pretrained model: new, base models, trained on a given corpora
|
80 |
- 🔶 : fine-tuned on domain-specific datasets model: pretrained models finetuned on more data
|
81 |
- 💬 : chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
|
82 |
- 🤝 : base merges and moerges model: merges or MoErges, models which have been merged or fused without additional fine-tuning. If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
|
83 |
"Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
|
84 |
-
|
|
|
|
|
85 |
- Community resources
|
86 |
|
87 |
"""
|
|
|
45 |
"""
|
46 |
# Which evaluations are you running? how can people reproduce what you have?
|
47 |
LLM_BENCHMARKS_TEXT = f"""
|
48 |
+
|
49 |
+
#The Goal of FlagEval-VLM Leaderboard
|
50 |
+
|
51 |
感谢您积极的参与评测,在未来,我们会持续推动 FlagEval-VLM Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
|
52 |
+
|
53 |
Thank you for your active participation in the evaluation. In the future, we will continue to promote FlagEval-VLM Leaderboard to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and fair list together.
|
54 |
+
|
55 |
+
#Context
|
56 |
FlagEval-VLM Leaderboard是视觉大语言排行榜,我们希望能够推动更加开放的生态,让视觉大语言模型开发者参与进来,为推动视觉大语言模型进步做出相应的贡献。 为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
|
57 |
+
|
58 |
FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to promote a more open ecosystem for visual large language model developers to participate and contribute accordingly to the advancement of visual large language models. To achieve the goal of fairness, all models are evaluated on the FlagEval platform using standardized GPUs and a unified environment to ensure fairness.
|
59 |
+
|
60 |
+
##How it works
|
61 |
+
|
62 |
We evaluate models on 9 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
|
63 |
+
|
64 |
- Chart_QA
|
65 |
- CMMU- a benchmark for Chinese multi-modal multi-type question understanding and reasoning
|
66 |
- CMMMU-a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
|
|
|
72 |
- CII-Bench-a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
|
73 |
- Blink
|
74 |
For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
75 |
+
|
76 |
+
##Details and logs
|
77 |
You can find:
|
78 |
- detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/results
|
79 |
- community queries and running status in the requests Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/requests
|
80 |
+
|
81 |
+
##Reproducibility
|
82 |
+
|
83 |
To reproduce our results, here is the commands you can run, using this version of the Eleuther AI Harness: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
|
84 |
The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit. You can expect results to vary slightly for different batch sizes because of padding.
|
85 |
|
86 |
Side note on the baseline scores:
|
87 |
- for log-likelihood evaluation, we select the random baseline
|
88 |
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
89 |
+
|
90 |
+
##Icons
|
91 |
+
|
92 |
- 🟢 : pretrained model: new, base models, trained on a given corpora
|
93 |
- 🔶 : fine-tuned on domain-specific datasets model: pretrained models finetuned on more data
|
94 |
- 💬 : chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
|
95 |
- 🤝 : base merges and moerges model: merges or MoErges, models which have been merged or fused without additional fine-tuning. If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
|
96 |
"Flagged" indicates that this model has been flagged by the community, and should probably be ignored! Clicking the link will redirect you to the discussion about the model.
|
97 |
+
|
98 |
+
##Useful links
|
99 |
+
|
100 |
- Community resources
|
101 |
|
102 |
"""
|