Evaluation

The test results in the following table are based on the MMLU benchmark.

In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.

In our experiment, the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.

Data Format MMLU Score
BF16 Official 79.92
FP4 Quantized 79.50

Quickstart

We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model. Here provides a simple command to show you how to run Qwen3-8B-fp4.

torchrun --nproc_per_node 1 \
    --master_port=22525 \
    -m chitu \
    serve.port=21002 \
    infer.cache_type=paged \
    infer.pp_size=1 \
    infer.tp_size=1 \
    models=Qwen3-8B-fp4 \
    models.ckpt_dir="your model path" \
    models.tokenizer_path="your model path" \
    dtype=float16 \
    infer.do_load=True \
    infer.max_reqs=1 \
    scheduler.prefill_first.num_tasks=100 \
    infer.max_seq_len=4096 \
    request.max_new_tokens=100 \
    infer.use_cuda_graph=True

Contact

[email protected]

Downloads last month
4
Safetensors
Model size
5.15B params
Tensor type
FP16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for qingcheng-ai/Qwen3-8B-fp4

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(89)
this model