qingcheng-ai
/

QwQ-32B-fp4

Text Generation

text-generation-inference

8-bit precision

Model card Files Files and versions Community

yunchen commited on 5 days ago

Commit

5f431ea

·

verified ·

1 Parent(s): 4223968

Create README.md

Files changed (1) hide show

README.md +47 -0

README.md ADDED Viewed

	@@ -0,0 +1,47 @@

+---
+license: apache-2.0
+base_model:
+- Qwen/QwQ-32B
+base_model_relation: quantized
+library_name: transformers
+tags:
+- Qwen
+- fp4
+---
+## Evaluation
+The test results in the following table are based on the MMLU benchmark.
+In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
+In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
+| Data Format | MMLU Score |
+|:---|:---|
+| BF16 Official | 84.36 |
+| FP4 Quantized | 80.07 |
+## Quickstart
+We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
+Here provides a simple command to show you how to run QwQ-32B-fp4.
+```bash
+torchrun --nproc_per_node 1 \
+    --master_port=22525 \
+    -m chitu \
+    serve.port=21002 \
+    infer.cache_type=paged \
+    infer.pp_size=1 \
+    infer.tp_size=1 \
+    models=QwQ-32B-fp4 \
+    models.ckpt_dir="your model path" \
+    models.tokenizer_path="your model path" \
+    dtype=float16 \
+    infer.do_load=True \
+    infer.max_reqs=1 \
+    scheduler.prefill_first.num_tasks=100 \
+    infer.max_seq_len=4096 \
+    request.max_new_tokens=100 \
+    infer.use_cuda_graph=True
+```
+## Contact
+[email protected]