MMLU-Pro evaluation score discrepancy
I noticed in the performance charts this model family is compared to Llama 3B, 8B, 70B where they scored 33.76%, 46.23%, 70.3 in MMLU-Pro but on the official MMLU-Pro leaderboard those models scored only got 22%, 36.6%, 65.9%. Scores for Qwen2.5 are also different .
https://i.imgur.com/5OwqSNY.png
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
You're right!
We wanted to ensure consistency, so we tested the models ourselves, using the hyperparams set in the config on HF model (using bfloat16 on H200s, and VLLM). The scores in our blog for the Llama and Qwen models are higher than the MMLU-Pro leaderboard (which means we have a higher baseline to compare with).
Of course, there can also be some discrepancy based on the inherent nature of LLM outputs, but I believe the large difference in MMLU here is the probably the other platforms not using the right Llama / Qwen chat template or not checking the solution correctly (i.e., using heuristics to check LLM outputs which can sometimes not work well).