RedHatAI
/

Meta-Llama-3.1-70B-Instruct-quantized.w8a8

@@ -33,7 +33,8 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
 - **Model Developers:** Neural Magic
 Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
-It achieves scores within 1.2% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
 ### Model Optimizations
@@ -132,15 +133,20 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w8a8")
 ## Evaluation
-The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
-Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
-This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
 **Note:** Results have been updated after Meta modified the chat template.
 ### Accuracy
-#### Open LLM Leaderboard evaluation scores
 <table>
   <tr>
    <td><strong>Benchmark</strong>
@@ -152,6 +158,20 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
    <td><strong>Recovery</strong>
    </td>
   </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
@@ -232,6 +252,104 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
    <td><strong>99.9%</strong>
    </td>
   </tr>
 </table>
 ### Reproduction

 - **Model Developers:** Neural Magic
 Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
+It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
+Meta-Llama-3.1-70B-Instruct-quantized.w8a8 achieves 98.8% recovery for the Arena-Hard evaluation, 99.9% for OpenLLM v1 (using Meta's prompting when available), 100.0% for OpenLLM v2, 98.7% for HumanEval pass@1, and 98.9% for HumanEval+ pass@1.
 ### Model Optimizations
 ## Evaluation
+This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
+In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
+Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
+The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
+We report below the scores obtained in each judgement and the average.
+OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
+This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
+HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
+Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
 **Note:** Results have been updated after Meta modified the chat template.
 ### Accuracy
 <table>
   <tr>
    <td><strong>Benchmark</strong>
    <td><strong>Recovery</strong>
    </td>
   </tr>
+  <tr>
+   <td><strong>Arena Hard</strong>
+   </td>
+   <td>57.0 (55.8 / 58.2)
+   </td>
+   <td>56.3 (56.0 / 56.6)
+   </td>
+   <td>98.8%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>OpenLLM v1</strong>
+   </td>
+  </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
    <td><strong>99.9%</strong>
    </td>
   </tr>
+  <tr>
+   <td><strong>OpenLLM v2</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>MMLU-Pro
+   </td>
+   <td>48.1
+   </td>
+   <td>47.1
+   </td>
+   <td>97.9%
+   </td>
+  </tr>
+  <tr>
+   <td>IFEval
+   </td>
+   <td>86.4
+   </td>
+   <td>86.6
+   </td>
+   <td>100.2%
+   </td>
+  </tr>
+  <tr>
+   <td>BBH
+   </td>
+   <td>55.8
+   </td>
+   <td>55.2
+   </td>
+   <td>98.9%
+   </td>
+  </tr>
+  <tr>
+   <td>Math |v| 5
+   </td>
+   <td>26.1
+   </td>
+   <td>23.9
+   </td>
+   <td>91.8%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA ()
+   </td>
+   <td>15.4
+   </td>
+   <td>13.6
+   </td>
+   <td>88.4%
+   </td>
+  </tr>
+  <tr>
+   <td>MuSR (5-shot)
+   </td>
+   <td>18.2
+   </td>
+   <td>16.8
+   </td>
+   <td>92.6%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Average</strong>
+   </td>
+   <td><strong>41.7</strong>
+   </td>
+   <td><strong>40.5</strong>
+   </td>
+   <td><strong>97.3%</strong>
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Coding</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>HumanEval pass@1
+   </td>
+   <td>79.7
+   </td>
+   <td>78.7
+   </td>
+   <td>98.7%
+   </td>
+  </tr>
+  <tr>
+   <td>HumanEval+ pass@1
+   </td>
+   <td>74.8
+   </td>
+   <td>74.0
+   </td>
+   <td>98.9%
+   </td>
+  </tr>
 </table>
 ### Reproduction