|
--- |
|
license: mit |
|
library_name: vllm |
|
base_model: |
|
- deepseek-ai/DeepSeek-R1-0528 |
|
pipeline_tag: text-generation |
|
tags: |
|
- deepseek |
|
- neuralmagic |
|
- redhat |
|
- llmcompressor |
|
- quantized |
|
- INT4 |
|
- GPTQ |
|
--- |
|
|
|
# DeepSeek-R1-0528-quantized.w4a16 |
|
|
|
## Model Overview |
|
- **Model Architecture:** DeepseekV3ForCausalLM |
|
- **Input:** Text |
|
- **Output:** Text |
|
- **Model Optimizations:** |
|
- **Activation quantization:** None |
|
- **Weight quantization:** INT4 |
|
- **Release Date:** 05/30/2025 |
|
- **Version:** 1.0 |
|
- **Model Developers:** Red Hat (Neural Magic) |
|
|
|
|
|
### Model Optimizations |
|
|
|
This model was obtained by quantizing weights of [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) to INT4 data type. |
|
This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%). |
|
Weight quantization also reduces disk size requirements by approximately 50%. |
|
|
|
|
|
## Deployment |
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from transformers import AutoTokenizer |
|
model_id = "RedHatAI/DeepSeek-R1-0528-quantized.w4a16" |
|
number_gpus = 8 |
|
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
prompt = "Give me a short introduction to large language model." |
|
llm = LLM(model=model_id, tensor_parallel_size=number_gpus) |
|
outputs = llm.generate(prompt, sampling_params) |
|
generated_text = outputs[0].outputs[0].text |
|
print(generated_text) |
|
``` |
|
|
|
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. |
|
|
|
|
|
## Evaluation |
|
|
|
The model was evaluated on popular reasoning tasks (AIME 2024, MATH-500, GPQA-Diamond) via [LightEval](https://github.com/huggingface/open-r1). |
|
For reasoning evaluations, we estimate pass@1 based on 10 runs with different seeds, `temperature=0.6`, `top_p=0.95` and `max_new_tokens=65536`. |
|
|
|
|
|
### Accuracy |
|
|
|
| | Recovery (%) | deepseek/DeepSeek-R1-0528 | RedHatAI/DeepSeek-R1-0528-quantized.w4a16<br>(this model) | |
|
| --------------------------- | :----------: | :------------------: | :--------------------------------------------------: | |
|
| AIME 2024<br>pass@1 | 98.50 | 88.66 | 87.33 | |
|
| MATH-500<br>pass@1 | 99.88 | 97.52 | 97.40 | |
|
| GPQA Diamond<br>pass@1 | 101.21 | 79.65 | 80.61 | |
|
| **Reasoning<br>Average Score** | **99.82** | **88.61** | **88.45** | |