Update README.md
Browse files
README.md
CHANGED
@@ -27,13 +27,17 @@ language:
|
|
27 |
- **Model Developers:** Neural Magic
|
28 |
|
29 |
This is a code completion AI model obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) dataset, followed by quantization
|
30 |
-
On the [HumanEval](https://arxiv.org/abs/2107.03374) benchmark, it achieves a pass@1 of
|
31 |
|
32 |
|
33 |
### Model Optimizations
|
34 |
|
35 |
-
This
|
36 |
-
|
|
|
|
|
|
|
|
|
37 |
|
38 |
|
39 |
## Deployment with vLLM
|
@@ -52,15 +56,18 @@ This model was evaluated on Neural Magic's fork of [EvalPlus](https://github.com
|
|
52 |
<td><strong>Metric</strong></td>
|
53 |
<td style="text-align: center"><strong>Llama-3.1-8B-evolcodealpaca</strong></td>
|
54 |
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-evolcodealpaca-2of4</strong></td>
|
|
|
55 |
</tr>
|
56 |
<tr>
|
57 |
<td>HumanEval pass@1</td>
|
58 |
<td style="text-align: center">48.5</td>
|
59 |
<td style="text-align: center">49.1</td>
|
|
|
60 |
</tr>
|
61 |
<tr>
|
62 |
<td>HumanEval+ pass@1</td>
|
63 |
<td style="text-align: center">44.2</td>
|
64 |
<td style="text-align: center">46.3</td>
|
|
|
65 |
</tr>
|
66 |
</table>
|
|
|
27 |
- **Model Developers:** Neural Magic
|
28 |
|
29 |
This is a code completion AI model obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) dataset, followed by quantization
|
30 |
+
On the [HumanEval](https://arxiv.org/abs/2107.03374) benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model [Llama-3.1-8B-evolcodealpaca](https://huggingface.co/neuralmagic/Llama-3.1-8B-evolcodealpaca) — demonstrating over **100% accuracy recovery**.
|
31 |
|
32 |
|
33 |
### Model Optimizations
|
34 |
|
35 |
+
This model was obtained by quantizing the weights of [Sparse-Llama-3.1-8B-evolcodealpaca-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4) to INT4 data type.
|
36 |
+
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
|
37 |
+
That is on top of the reduction of 50% of weights via 2:4 pruning employed on [Sparse-Llama-3.1-8B-evolcodealpaca-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-evolcodealpaca-2of4).
|
38 |
+
|
39 |
+
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
|
40 |
+
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
41 |
|
42 |
|
43 |
## Deployment with vLLM
|
|
|
56 |
<td><strong>Metric</strong></td>
|
57 |
<td style="text-align: center"><strong>Llama-3.1-8B-evolcodealpaca</strong></td>
|
58 |
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-evolcodealpaca-2of4</strong></td>
|
59 |
+
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16</strong></td>
|
60 |
</tr>
|
61 |
<tr>
|
62 |
<td>HumanEval pass@1</td>
|
63 |
<td style="text-align: center">48.5</td>
|
64 |
<td style="text-align: center">49.1</td>
|
65 |
+
<td style="text-align: center">50.6</td>
|
66 |
</tr>
|
67 |
<tr>
|
68 |
<td>HumanEval+ pass@1</td>
|
69 |
<td style="text-align: center">44.2</td>
|
70 |
<td style="text-align: center">46.3</td>
|
71 |
+
<td style="text-align: center">48.0</td>
|
72 |
</tr>
|
73 |
</table>
|