nm-research commited on
Commit
7598972
·
verified ·
1 Parent(s): cd395db

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
+ base_model: Qwen/Qwen2.5-3B
8
+ tags:
9
+ - chat
10
+ - neuralmagic
11
+ - llmcompressor
12
+ ---
13
+
14
+ # Qwen2.5-3B-quantized.w4a16
15
+
16
+ ## Model Overview
17
+ - **Model Architecture:** Qwen2
18
+ - **Input:** Text
19
+ - **Output:** Text
20
+ - **Model Optimizations:**
21
+ - **Weight quantization:** INT4
22
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B), this models is intended for assistant-like chat.
23
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
24
+ - **Release Date:** 12/17/2024
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B).
29
+ It achieves an average score of 62.18 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 63.59.
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing the weights and activations of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) to INT8 data type.
34
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
35
+
36
+ Only weights of the linear operators within transformers blocks are quantized.
37
+ Symmetric per-group quantization is applied, in which a linear scaling per group of 64 parameters maps the INT4 and floating point representations of the quantized weights.
38
+
39
+ ## Deployment
40
+
41
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
42
+
43
+ ```python
44
+ from vllm import LLM, SamplingParams
45
+ from transformers import AutoTokenizer
46
+
47
+ model_id = "neuralmagic-ent/Qwen2.5-3B-quantized.w4a16"
48
+ number_gpus = 1
49
+ max_model_len = 8192
50
+
51
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
54
+
55
+ prompt = "Give me a short introduction to large language model."
56
+
57
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
58
+
59
+ outputs = llm.generate(prompt, sampling_params)
60
+
61
+ generated_text = outputs[0].outputs[0].text
62
+ print(generated_text)
63
+ ```
64
+
65
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
66
+
67
+
68
+ ## Evaluation
69
+
70
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
71
+ ```
72
+ lm_eval \
73
+ --model vllm \
74
+ --model_args pretrained="neuralmagic-ent/Qwen2.5-3B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.9,add_bos_token=True,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
75
+ --tasks openllm \
76
+ --batch_size auto
77
+ ```
78
+
79
+ ### Accuracy
80
+
81
+ #### Open LLM Leaderboard evaluation scores
82
+ <table>
83
+ <tr>
84
+ <td><strong>Benchmark</strong>
85
+ </td>
86
+ <td><strong>Qwen2.5-3B</strong>
87
+ </td>
88
+ <td><strong>Qwen2.5-3B-quantized.w4a16 (this model)</strong>
89
+ </td>
90
+ <td><strong>Recovery</strong>
91
+ </td>
92
+ </tr>
93
+ <tr>
94
+ <td>MMLU (5-shot)
95
+ </td>
96
+ <td>65.68
97
+ </td>
98
+ <td>64.10
99
+ </td>
100
+ <td>97.6%
101
+ </td>
102
+ </tr>
103
+ <tr>
104
+ <td>ARC Challenge (25-shot)
105
+ </td>
106
+ <td>53.58
107
+ </td>
108
+ <td>51.19
109
+ </td>
110
+ <td>95.6%
111
+ </td>
112
+ </tr>
113
+ <tr>
114
+ <td>GSM-8K (5-shot, strict-match)
115
+ </td>
116
+ <td>68.23
117
+ </td>
118
+ <td>67.85
119
+ </td>
120
+ <td>99.4%
121
+ </td>
122
+ </tr>
123
+ <tr>
124
+ <td>Hellaswag (10-shot)
125
+ </td>
126
+ <td>74.46
127
+ </td>
128
+ <td>73.36
129
+ </td>
130
+ <td>98.5%
131
+ </td>
132
+ </tr>
133
+ <tr>
134
+ <td>Winogrande (5-shot)
135
+ </td>
136
+ <td>70.64
137
+ </td>
138
+ <td>70.32
139
+ </td>
140
+ <td>99.6%
141
+ </td>
142
+ </tr>
143
+ <tr>
144
+ <td>TruthfulQA (0-shot, mc2)
145
+ </td>
146
+ <td>48.93
147
+ </td>
148
+ <td>46.22
149
+ </td>
150
+ <td>94.2%
151
+ </td>
152
+ </tr>
153
+ <tr>
154
+ <td><strong>Average</strong>
155
+ </td>
156
+ <td><strong>63.59</strong>
157
+ </td>
158
+ <td><strong>62.18</strong>
159
+ </td>
160
+ <td><strong>97.8%</strong>
161
+ </td>
162
+ </tr>
163
+ </table>
164
+