qqc1989 commited on
Commit
98c3a60
Β·
verified Β·
1 Parent(s): 4fe78b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +256 -3
README.md CHANGED
@@ -1,3 +1,256 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-1.7B
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - Qwen
10
+ - Qwen3
11
+ - Int8
12
+ ---
13
+
14
+
15
+ # Qwen3-1.7B-Int8
16
+
17
+ This version of Qwen3-1.7B-Int8 has been converted to run on the Axera NPU using **w8a16** quantization.
18
+
19
+ This model has been optimized with the following LoRA:
20
+
21
+ Compatible with Pulsar2 version: 4.0-temp(Not released yet)
22
+
23
+ ## Convert tools links:
24
+
25
+ For those who are interested in model conversion, you can try to export axmodel through the original repo :
26
+ https://huggingface.co/Qwen/Qwen3-1.7B
27
+
28
+ [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
29
+
30
+ [AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm)
31
+
32
+
33
+ ## Support Platform
34
+
35
+ - AX650
36
+ - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
37
+ - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
38
+
39
+ |Chips|w8a16|w4a16|
40
+ |--|--|--|
41
+ |AX650| 9.5 tokens/sec|TBD|
42
+
43
+ ## How to use
44
+
45
+ Download all files from this repository to the device
46
+
47
+ ```
48
+ root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# tree -L 1
49
+ .
50
+ |-- config.json
51
+ |-- main_ax650
52
+ |-- main_axcl_aarch64
53
+ |-- main_axcl_x86
54
+ |-- post_config.json
55
+ |-- qwen2.5_tokenizer
56
+ |-- qwen3-1.7b-ax650
57
+ |-- qwen3_tokenizer
58
+ |-- qwen3_tokenizer_uid.py
59
+ |-- run_qwen3_1.7b_int8_ctx_ax650.sh
60
+ |-- run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
61
+ `-- run_qwen3_1.7b_int8_ctx_axcl_x86.sh
62
+
63
+ 3 directories, 9 files
64
+ root@ax650:/mnt/qtang/llm-test/qwen3-1.7b#
65
+ ```
66
+
67
+ #### Start the Tokenizer service
68
+
69
+ Install requirement
70
+
71
+ ```
72
+ pip install transformers jinja2
73
+ ```
74
+
75
+ ```
76
+ root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# python3 qwen3_tokenizer_uid.py
77
+ None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
78
+ Server running at http://0.0.0.0:12345
79
+ ```
80
+
81
+ #### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
82
+
83
+ Open another terminal and run `run_qwen3_1.7b_int8_ctx_ax650.sh`
84
+
85
+ ```
86
+ root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# ./run_qwen3_1.7b_int8_ctx_ax650.sh
87
+ [I][ Init][ 110]: LLM init start
88
+ [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
89
+ [I][ Init][ 57]: uid: 7a057c11-c513-485f-84a1-1d28dcbeb89d
90
+ bos_id: -1, eos_id: 151645
91
+ 3% | β–ˆβ–ˆ | 1 / 31 [3.97s<123.16s, 0.25 count/s] tokenizer init ok
92
+ [I][ Init][ 26]: LLaMaEmbedSelector use mmap
93
+ 100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 31 / 31 [23.76s<23.76s, 1.30 count/s] init post axmodel ok,remain_cmm(8740 MB)
94
+ [I][ Init][ 188]: max_token_len : 2559
95
+ [I][ Init][ 193]: kv_cache_size : 1024, kv_cache_num: 2559
96
+ [I][ Init][ 201]: prefill_token_num : 128
97
+ [I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
98
+ [I][ Init][ 205]: grp: 2, prefill_max_token_num : 512
99
+ [I][ Init][ 205]: grp: 3, prefill_max_token_num : 1024
100
+ [I][ Init][ 205]: grp: 4, prefill_max_token_num : 1536
101
+ [I][ Init][ 205]: grp: 5, prefill_max_token_num : 2048
102
+ [I][ Init][ 209]: prefill_max_token_num : 2048
103
+ [I][ load_config][ 282]: load config:
104
+ {
105
+ "enable_repetition_penalty": false,
106
+ "enable_temperature": false,
107
+ "enable_top_k_sampling": true,
108
+ "enable_top_p_sampling": false,
109
+ "penalty_window": 20,
110
+ "repetition_penalty": 1.2,
111
+ "temperature": 0.9,
112
+ "top_k": 1,
113
+ "top_p": 0.8
114
+ }
115
+
116
+ [I][ Init][ 218]: LLM init ok
117
+ Type "q" to exit, Ctrl+c to stop current running
118
+ [I][ GenerateKVCachePrefill][ 270]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
119
+ [I][ GenerateKVCachePrefill][ 307]: input_num_token:21
120
+ [I][ main][ 230]: precompute_len: 21
121
+ [I][ main][ 231]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
122
+ prompt >> 1+1=?
123
+ [I][ SetKVCache][ 530]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:16
124
+ [I][ SetKVCache][ 533]: current prefill_max_token_num:1920
125
+ [I][ Run][ 659]: input token num : 16, prefill_split_num : 1
126
+ [I][ Run][ 685]: input_num_token:16
127
+ [I][ Run][ 808]: ttft: 678.72 ms
128
+ <think>
129
+
130
+ </think>
131
+
132
+ 1 + 1 = 2.
133
+
134
+ [N][ Run][ 922]: hit eos,avg 9.16 token/s
135
+
136
+ [I][ GetKVCache][ 499]: precompute_len:49, remaining:1999
137
+ prompt >> who are you?
138
+ [I][ SetKVCache][ 530]: prefill_grpid:2 kv_cache_num:512 precompute_len:49 input_num_token:16
139
+ [I][ SetKVCache][ 533]: current prefill_max_token_num:1920
140
+ [I][ Run][ 659]: input token num : 16, prefill_split_num : 1
141
+ [I][ Run][ 685]: input_num_token:16
142
+ [I][ Run][ 808]: ttft: 677.87 ms
143
+ <think>
144
+
145
+ </think>
146
+
147
+ I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions,
148
+ help with tasks, and provide information on various topics. I am designed to be helpful and useful to users.
149
+
150
+ [N][ Run][ 922]: hit eos,avg 9.13 token/s
151
+
152
+ [I][ GetKVCache][ 499]: precompute_len:110, remaining:1938
153
+ prompt >> q
154
+ ```
155
+
156
+ #### Inference with M.2 Accelerator card
157
+
158
+ [What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.
159
+
160
+ ```
161
+ (base) axera@raspberrypi:~/samples/qwen3-1.7b $ ./run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
162
+ [I][ Init][ 136]: LLM init start
163
+ [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
164
+ [I][ Init][ 57]: uid: ea509ef6-ab6c-49b0-9dcf-931db2ce1bf7
165
+ bos_id: -1, eos_id: 151645
166
+ 3% | β–ˆβ–ˆ | 1 / 31 [0.98s<30.47s, 1.02 count/s] tokenizer init ok
167
+ [I][ Init][ 45]: LLaMaEmbedSelector use mmap
168
+ 6% | β–ˆβ–ˆβ–ˆ | 2 / 31 [0.98s<15.24s, 2.03 count/s] embed_selector init ok
169
+ [I][ run][ 30]: AXCLWorker start with devid 0
170
+ 100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 31 / 31 [49.40s<49.40s, 0.63 count/s] init post axmodel ok,remain_cmm(3788 MB)
171
+ [I][ Init][ 237]: max_token_len : 2559
172
+ [I][ Init][ 240]: kv_cache_size : 1024, kv_cache_num: 2559
173
+ [I][ Init][ 248]: prefill_token_num : 128
174
+ [I][ Init][ 252]: grp: 1, prefill_max_token_num : 1
175
+ [I][ Init][ 252]: grp: 2, prefill_max_token_num : 512
176
+ [I][ Init][ 252]: grp: 3, prefill_max_token_num : 1024
177
+ [I][ Init][ 252]: grp: 4, prefill_max_token_num : 1536
178
+ [I][ Init][ 252]: grp: 5, prefill_max_token_num : 2048
179
+ [I][ Init][ 256]: prefill_max_token_num : 2048
180
+ ________________________
181
+ | ID| remain cmm(MB)|
182
+ ========================
183
+ | 0| 3788|
184
+ Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―
185
+ [I][ load_config][ 282]: load config:
186
+ {
187
+ "enable_repetition_penalty": false,
188
+ "enable_temperature": false,
189
+ "enable_top_k_sampling": true,
190
+ "enable_top_p_sampling": false,
191
+ "penalty_window": 20,
192
+ "repetition_penalty": 1.2,
193
+ "temperature": 0.9,
194
+ "top_k": 1,
195
+ "top_p": 0.8
196
+ }
197
+
198
+ [I][ Init][ 279]: LLM init ok
199
+ Type "q" to exit, Ctrl+c to stop current running
200
+ [I][ GenerateKVCachePrefill][ 335]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
201
+ [I][ GenerateKVCachePrefill][ 372]: input_num_token:21
202
+ [I][ main][ 236]: precompute_len: 21
203
+ [I][ main][ 237]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
204
+ prompt >> 1+2=?
205
+ [I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:16
206
+ [I][ SetKVCache][ 631]: current prefill_max_token_num:1920
207
+ [I][ Run][ 869]: input token num : 16, prefill_split_num : 1
208
+ [I][ Run][ 901]: input_num_token:16
209
+ [I][ Run][1030]: ttft: 796.97 ms
210
+ <think>
211
+
212
+ </think>
213
+
214
+ 1 + 2 = 3.
215
+
216
+ [N][ Run][1182]: hit eos,avg 7.43 token/s
217
+
218
+ [I][ GetKVCache][ 597]: precompute_len:49, remaining:1999
219
+ prompt >> who are you?
220
+ [I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:49 input_num_token:16
221
+ [I][ SetKVCache][ 631]: current prefill_max_token_num:1920
222
+ [I][ Run][ 869]: input token num : 16, prefill_split_num : 1
223
+ [I][ Run][ 901]: input_num_token:16
224
+ [I][ Run][1030]: ttft: 800.01 ms
225
+ <think>
226
+
227
+ </think>
228
+
229
+ I am Qwen, a large language model developed by Alibaba Cloud. I can help with various tasks,
230
+ such as answering questions, writing text, providing explanations, and more. If you have any questions or need assistance, feel free to ask!
231
+
232
+ [N][ Run][1182]: hit eos,avg 7.42 token/s
233
+
234
+ [I][ GetKVCache][ 597]: precompute_len:118, remaining:1930
235
+ prompt >> q
236
+ [I][ run][ 80]: AXCLWorker exit with devid 0
237
+ (base) axera@raspberrypi:~/samples/qwen3-1.7b $
238
+ (base) axera@raspberrypi:~ $ axcl-smi
239
+ +------------------------------------------------------------------------------------------------+
240
+ | AXCL-SMI V3.4.0_20250423020139 Driver V3.4.0_20250423020139 |
241
+ +-----------------------------------------+--------------+---------------------------------------+
242
+ | Card Name Firmware | Bus-Id | Memory-Usage |
243
+ | Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
244
+ |=========================================+==============+=======================================|
245
+ | 0 AX650N V3.4.0 | 0000:01:00.0 | 183 MiB / 945 MiB |
246
+ | -- 38C -- / -- | 0% 0% | 3251 MiB / 7040 MiB |
247
+ +-----------------------------------------+--------------+---------------------------------------+
248
+
249
+ +------------------------------------------------------------------------------------------------+
250
+ | Processes: |
251
+ | Card PID Process Name NPU Memory Usage |
252
+ |================================================================================================|
253
+ | 0 71266 /home/axera/samples/qwen3-1.7b/main_axcl_aarch64 2193524 KiB |
254
+ +------------------------------------------------------------------------------------------------+
255
+ (base) axera@raspberrypi:~ $
256
+ ```