yunchen commited on
Commit
5f431ea
·
verified ·
1 Parent(s): 4223968

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/QwQ-32B
5
+ base_model_relation: quantized
6
+ library_name: transformers
7
+ tags:
8
+ - Qwen
9
+ - fp4
10
+ ---
11
+ ## Evaluation
12
+
13
+ The test results in the following table are based on the MMLU benchmark.
14
+
15
+ In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
16
+
17
+ In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
18
+
19
+ | Data Format | MMLU Score |
20
+ |:---|:---|
21
+ | BF16 Official | 84.36 |
22
+ | FP4 Quantized | 80.07 |
23
+ ## Quickstart
24
+ We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
25
+ Here provides a simple command to show you how to run QwQ-32B-fp4.
26
+ ```bash
27
+ torchrun --nproc_per_node 1 \
28
+ --master_port=22525 \
29
+ -m chitu \
30
+ serve.port=21002 \
31
+ infer.cache_type=paged \
32
+ infer.pp_size=1 \
33
+ infer.tp_size=1 \
34
+ models=QwQ-32B-fp4 \
35
+ models.ckpt_dir="your model path" \
36
+ models.tokenizer_path="your model path" \
37
+ dtype=float16 \
38
+ infer.do_load=True \
39
+ infer.max_reqs=1 \
40
+ scheduler.prefill_first.num_tasks=100 \
41
+ infer.max_seq_len=4096 \
42
+ request.max_new_tokens=100 \
43
+ infer.use_cuda_graph=True
44
+ ```
45
+ ## Contact
46
+
47