README.md · RedHatAI/DeepSeek-R1-0528-quantized.w4a16 at main

DeepSeek-R1-0528-quantized.w4a16 / README.md

ekurtic

Update README.md

877279e verified 16 days ago

preview code

raw

history blame contribute delete

2.82 kB

	---
	license: mit
	library_name: vllm
	base_model:
	- deepseek-ai/DeepSeek-R1-0528
	pipeline_tag: text-generation
	tags:
	- deepseek
	- neuralmagic
	- redhat
	- llmcompressor
	- quantized
	- INT4
	- GPTQ
	---

	# DeepSeek-R1-0528-quantized.w4a16

	## Model Overview
	- Model Architecture: DeepseekV3ForCausalLM
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Activation quantization: None
	- Weight quantization: INT4
	- Release Date: 05/30/2025
	- Version: 1.0
	- Model Developers: Red Hat (Neural Magic)


	### Model Optimizations

	This model was obtained by quantizing weights of [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) to INT4 data type.
	This optimization reduces the number of bits used to represent weights from 8 to 4, reducing GPU memory requirements (by approximately 50%).
	Weight quantization also reduces disk size requirements by approximately 50%.


	## Deployment

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer
	model_id = "RedHatAI/DeepSeek-R1-0528-quantized.w4a16"
	number_gpus = 8
	sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=256)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	prompt = "Give me a short introduction to large language model."
	llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
	outputs = llm.generate(prompt, sampling_params)
	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```

	vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.


	## Evaluation

	The model was evaluated on popular reasoning tasks (AIME 2024, MATH-500, GPQA-Diamond) via [LightEval](https://github.com/huggingface/open-r1).
	For reasoning evaluations, we estimate pass@1 based on 10 runs with different seeds, `temperature=0.6`, `top_p=0.95` and `max_new_tokens=65536`.


	### Accuracy

	\| \| Recovery (%) \| deepseek/DeepSeek-R1-0528 \| RedHatAI/DeepSeek-R1-0528-quantized.w4a16<br>(this model) \|
	\| --------------------------- \| :----------: \| :------------------: \| :--------------------------------------------------: \|
	\| AIME 2024<br>pass@1 \| 98.50 \| 88.66 \| 87.33 \|
	\| MATH-500<br>pass@1 \| 99.88 \| 97.52 \| 97.40 \|
	\| GPQA Diamond<br>pass@1 \| 101.21 \| 79.65 \| 80.61 \|
	\| Reasoning<br>Average Score \| 99.82 \| 88.61 \| 88.45 \|