alexmarques commited on
Commit
69eb5f4
·
verified ·
1 Parent(s): 59e9c09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -5
README.md CHANGED
@@ -33,7 +33,8 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
- It achieves scores within 1.2% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
 
37
 
38
  ### Model Optimizations
39
 
@@ -132,15 +133,20 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w8a8")
132
 
133
  ## Evaluation
134
 
135
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
136
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
- This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
 
 
 
 
 
 
138
 
139
  **Note:** Results have been updated after Meta modified the chat template.
140
 
141
  ### Accuracy
142
 
143
- #### Open LLM Leaderboard evaluation scores
144
  <table>
145
  <tr>
146
  <td><strong>Benchmark</strong>
@@ -152,6 +158,20 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
152
  <td><strong>Recovery</strong>
153
  </td>
154
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  <tr>
156
  <td>MMLU (5-shot)
157
  </td>
@@ -232,6 +252,104 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
232
  <td><strong>99.9%</strong>
233
  </td>
234
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
235
  </table>
236
 
237
  ### Reproduction
 
33
  - **Model Developers:** Neural Magic
34
 
35
  Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-70B-Instruct-quantized.w8a8 achieves 98.8% recovery for the Arena-Hard evaluation, 99.9% for OpenLLM v1 (using Meta's prompting when available), 100.0% for OpenLLM v2, 98.7% for HumanEval pass@1, and 98.9% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
133
 
134
  ## Evaluation
135
 
136
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
137
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
138
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
139
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
140
+ We report below the scores obtained in each judgement and the average.
141
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
142
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
143
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
144
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
145
 
146
  **Note:** Results have been updated after Meta modified the chat template.
147
 
148
  ### Accuracy
149
 
 
150
  <table>
151
  <tr>
152
  <td><strong>Benchmark</strong>
 
158
  <td><strong>Recovery</strong>
159
  </td>
160
  </tr>
161
+ <tr>
162
+ <td><strong>Arena Hard</strong>
163
+ </td>
164
+ <td>57.0 (55.8 / 58.2)
165
+ </td>
166
+ <td>56.3 (56.0 / 56.6)
167
+ </td>
168
+ <td>98.8%
169
+ </td>
170
+ </tr>
171
+ <tr>
172
+ <td><strong>OpenLLM v1</strong>
173
+ </td>
174
+ </tr>
175
  <tr>
176
  <td>MMLU (5-shot)
177
  </td>
 
252
  <td><strong>99.9%</strong>
253
  </td>
254
  </tr>
255
+ <tr>
256
+ <td><strong>OpenLLM v2</strong>
257
+ </td>
258
+ </tr>
259
+ <tr>
260
+ <td>MMLU-Pro
261
+ </td>
262
+ <td>48.1
263
+ </td>
264
+ <td>47.1
265
+ </td>
266
+ <td>97.9%
267
+ </td>
268
+ </tr>
269
+ <tr>
270
+ <td>IFEval
271
+ </td>
272
+ <td>86.4
273
+ </td>
274
+ <td>86.6
275
+ </td>
276
+ <td>100.2%
277
+ </td>
278
+ </tr>
279
+ <tr>
280
+ <td>BBH
281
+ </td>
282
+ <td>55.8
283
+ </td>
284
+ <td>55.2
285
+ </td>
286
+ <td>98.9%
287
+ </td>
288
+ </tr>
289
+ <tr>
290
+ <td>Math |v| 5
291
+ </td>
292
+ <td>26.1
293
+ </td>
294
+ <td>23.9
295
+ </td>
296
+ <td>91.8%
297
+ </td>
298
+ </tr>
299
+ <tr>
300
+ <td>GPQA ()
301
+ </td>
302
+ <td>15.4
303
+ </td>
304
+ <td>13.6
305
+ </td>
306
+ <td>88.4%
307
+ </td>
308
+ </tr>
309
+ <tr>
310
+ <td>MuSR (5-shot)
311
+ </td>
312
+ <td>18.2
313
+ </td>
314
+ <td>16.8
315
+ </td>
316
+ <td>92.6%
317
+ </td>
318
+ </tr>
319
+ <tr>
320
+ <td><strong>Average</strong>
321
+ </td>
322
+ <td><strong>41.7</strong>
323
+ </td>
324
+ <td><strong>40.5</strong>
325
+ </td>
326
+ <td><strong>97.3%</strong>
327
+ </td>
328
+ </tr>
329
+ <tr>
330
+ <td><strong>Coding</strong>
331
+ </td>
332
+ </tr>
333
+ <tr>
334
+ <td>HumanEval pass@1
335
+ </td>
336
+ <td>79.7
337
+ </td>
338
+ <td>78.7
339
+ </td>
340
+ <td>98.7%
341
+ </td>
342
+ </tr>
343
+ <tr>
344
+ <td>HumanEval+ pass@1
345
+ </td>
346
+ <td>74.8
347
+ </td>
348
+ <td>74.0
349
+ </td>
350
+ <td>98.9%
351
+ </td>
352
+ </tr>
353
  </table>
354
 
355
  ### Reproduction