|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Qwen/QwQ-32B |
|
base_model_relation: quantized |
|
library_name: transformers |
|
tags: |
|
- Qwen |
|
- fp4 |
|
--- |
|
## Evaluation |
|
|
|
The test results in the following table are based on the MMLU benchmark. |
|
|
|
In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain. |
|
|
|
In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.** |
|
|
|
| Data Format | MMLU Score | |
|
|:---|:---| |
|
| BF16 Official | 84.36 | |
|
| FP4 Quantized | 80.07 | |
|
## Quickstart |
|
We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model. |
|
Here provides a simple command to show you how to run QwQ-32B-fp4. |
|
```bash |
|
torchrun --nproc_per_node 1 \ |
|
--master_port=22525 \ |
|
-m chitu \ |
|
serve.port=21002 \ |
|
infer.cache_type=paged \ |
|
infer.pp_size=1 \ |
|
infer.tp_size=1 \ |
|
models=QwQ-32B-fp4 \ |
|
models.ckpt_dir="your model path" \ |
|
models.tokenizer_path="your model path" \ |
|
dtype=float16 \ |
|
infer.do_load=True \ |
|
infer.max_reqs=1 \ |
|
scheduler.prefill_first.num_tasks=100 \ |
|
infer.max_seq_len=4096 \ |
|
request.max_new_tokens=100 \ |
|
infer.use_cuda_graph=True |
|
``` |
|
## Contact |
|
|
|
[email protected] |