Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- Qwen/QwQ-32B
|
5 |
+
base_model_relation: quantized
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- Qwen
|
9 |
+
- fp4
|
10 |
+
---
|
11 |
+
## Evaluation
|
12 |
+
|
13 |
+
The test results in the following table are based on the MMLU benchmark.
|
14 |
+
|
15 |
+
In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
|
16 |
+
|
17 |
+
In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
|
18 |
+
|
19 |
+
| Data Format | MMLU Score |
|
20 |
+
|:---|:---|
|
21 |
+
| BF16 Official | 84.36 |
|
22 |
+
| FP4 Quantized | 80.07 |
|
23 |
+
## Quickstart
|
24 |
+
We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
|
25 |
+
Here provides a simple command to show you how to run QwQ-32B-fp4.
|
26 |
+
```bash
|
27 |
+
torchrun --nproc_per_node 1 \
|
28 |
+
--master_port=22525 \
|
29 |
+
-m chitu \
|
30 |
+
serve.port=21002 \
|
31 |
+
infer.cache_type=paged \
|
32 |
+
infer.pp_size=1 \
|
33 |
+
infer.tp_size=1 \
|
34 |
+
models=QwQ-32B-fp4 \
|
35 |
+
models.ckpt_dir="your model path" \
|
36 |
+
models.tokenizer_path="your model path" \
|
37 |
+
dtype=float16 \
|
38 |
+
infer.do_load=True \
|
39 |
+
infer.max_reqs=1 \
|
40 |
+
scheduler.prefill_first.num_tasks=100 \
|
41 |
+
infer.max_seq_len=4096 \
|
42 |
+
request.max_new_tokens=100 \
|
43 |
+
infer.use_cuda_graph=True
|
44 |
+
```
|
45 |
+
## Contact
|
46 |
+
|
47 |