File size: 11,537 Bytes
98c3a60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-1.7B
pipeline_tag: text-generation
tags:
- Qwen
- Qwen3
- Int8
---
# Qwen3-1.7B-Int8
This version of Qwen3-1.7B-Int8 has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.0-temp(Not released yet)
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/Qwen/Qwen3-1.7B
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm)
## Support Platform
- AX650
- [M4N-Dock(η±θ―ζ΄ΎPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 9.5 tokens/sec|TBD|
## How to use
Download all files from this repository to the device
```
root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# tree -L 1
.
|-- config.json
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- qwen2.5_tokenizer
|-- qwen3-1.7b-ax650
|-- qwen3_tokenizer
|-- qwen3_tokenizer_uid.py
|-- run_qwen3_1.7b_int8_ctx_ax650.sh
|-- run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
`-- run_qwen3_1.7b_int8_ctx_axcl_x86.sh
3 directories, 9 files
root@ax650:/mnt/qtang/llm-test/qwen3-1.7b#
```
#### Start the Tokenizer service
Install requirement
```
pip install transformers jinja2
```
```
root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# python3 qwen3_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
```
#### Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650N DEMO Board
Open another terminal and run `run_qwen3_1.7b_int8_ctx_ax650.sh`
```
root@ax650:/mnt/qtang/llm-test/qwen3-1.7b# ./run_qwen3_1.7b_int8_ctx_ax650.sh
[I][ Init][ 110]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: 7a057c11-c513-485f-84a1-1d28dcbeb89d
bos_id: -1, eos_id: 151645
3% | ββ | 1 / 31 [3.97s<123.16s, 0.25 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 31 / 31 [23.76s<23.76s, 1.30 count/s] init post axmodel ok,remain_cmm(8740 MB)
[I][ Init][ 188]: max_token_len : 2559
[I][ Init][ 193]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 201]: prefill_token_num : 128
[I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 205]: grp: 2, prefill_max_token_num : 512
[I][ Init][ 205]: grp: 3, prefill_max_token_num : 1024
[I][ Init][ 205]: grp: 4, prefill_max_token_num : 1536
[I][ Init][ 205]: grp: 5, prefill_max_token_num : 2048
[I][ Init][ 209]: prefill_max_token_num : 2048
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 270]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 307]: input_num_token:21
[I][ main][ 230]: precompute_len: 21
[I][ main][ 231]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> 1+1=?
[I][ SetKVCache][ 530]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:16
[I][ SetKVCache][ 533]: current prefill_max_token_num:1920
[I][ Run][ 659]: input token num : 16, prefill_split_num : 1
[I][ Run][ 685]: input_num_token:16
[I][ Run][ 808]: ttft: 678.72 ms
<think>
</think>
1 + 1 = 2.
[N][ Run][ 922]: hit eos,avg 9.16 token/s
[I][ GetKVCache][ 499]: precompute_len:49, remaining:1999
prompt >> who are you?
[I][ SetKVCache][ 530]: prefill_grpid:2 kv_cache_num:512 precompute_len:49 input_num_token:16
[I][ SetKVCache][ 533]: current prefill_max_token_num:1920
[I][ Run][ 659]: input token num : 16, prefill_split_num : 1
[I][ Run][ 685]: input_num_token:16
[I][ Run][ 808]: ttft: 677.87 ms
<think>
</think>
I am Qwen, a large language model developed by Alibaba Cloud. I can answer questions,
help with tasks, and provide information on various topics. I am designed to be helpful and useful to users.
[N][ Run][ 922]: hit eos,avg 9.13 token/s
[I][ GetKVCache][ 499]: precompute_len:110, remaining:1938
prompt >> q
```
#### Inference with M.2 Accelerator card
[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.
```
(base) axera@raspberrypi:~/samples/qwen3-1.7b $ ./run_qwen3_1.7b_int8_ctx_axcl_aarch64.sh
[I][ Init][ 136]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: ea509ef6-ab6c-49b0-9dcf-931db2ce1bf7
bos_id: -1, eos_id: 151645
3% | ββ | 1 / 31 [0.98s<30.47s, 1.02 count/s] tokenizer init ok
[I][ Init][ 45]: LLaMaEmbedSelector use mmap
6% | βββ | 2 / 31 [0.98s<15.24s, 2.03 count/s] embed_selector init ok
[I][ run][ 30]: AXCLWorker start with devid 0
100% | ββββββββββββββββββββββββββββββββ | 31 / 31 [49.40s<49.40s, 0.63 count/s] init post axmodel ok,remain_cmm(3788 MB)
[I][ Init][ 237]: max_token_len : 2559
[I][ Init][ 240]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 248]: prefill_token_num : 128
[I][ Init][ 252]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 252]: grp: 2, prefill_max_token_num : 512
[I][ Init][ 252]: grp: 3, prefill_max_token_num : 1024
[I][ Init][ 252]: grp: 4, prefill_max_token_num : 1536
[I][ Init][ 252]: grp: 5, prefill_max_token_num : 2048
[I][ Init][ 256]: prefill_max_token_num : 2048
________________________
| ID| remain cmm(MB)|
========================
| 0| 3788|
Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―Β―
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 335]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 372]: input_num_token:21
[I][ main][ 236]: precompute_len: 21
[I][ main][ 237]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> 1+2=?
[I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:16
[I][ SetKVCache][ 631]: current prefill_max_token_num:1920
[I][ Run][ 869]: input token num : 16, prefill_split_num : 1
[I][ Run][ 901]: input_num_token:16
[I][ Run][1030]: ttft: 796.97 ms
<think>
</think>
1 + 2 = 3.
[N][ Run][1182]: hit eos,avg 7.43 token/s
[I][ GetKVCache][ 597]: precompute_len:49, remaining:1999
prompt >> who are you?
[I][ SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:49 input_num_token:16
[I][ SetKVCache][ 631]: current prefill_max_token_num:1920
[I][ Run][ 869]: input token num : 16, prefill_split_num : 1
[I][ Run][ 901]: input_num_token:16
[I][ Run][1030]: ttft: 800.01 ms
<think>
</think>
I am Qwen, a large language model developed by Alibaba Cloud. I can help with various tasks,
such as answering questions, writing text, providing explanations, and more. If you have any questions or need assistance, feel free to ask!
[N][ Run][1182]: hit eos,avg 7.42 token/s
[I][ GetKVCache][ 597]: precompute_len:118, remaining:1930
prompt >> q
[I][ run][ 80]: AXCLWorker exit with devid 0
(base) axera@raspberrypi:~/samples/qwen3-1.7b $
(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V3.4.0_20250423020139 Driver V3.4.0_20250423020139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
| 0 AX650N V3.4.0 | 0000:01:00.0 | 183 MiB / 945 MiB |
| -- 38C -- / -- | 0% 0% | 3251 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 71266 /home/axera/samples/qwen3-1.7b/main_axcl_aarch64 2193524 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
``` |