File size: 4,937 Bytes
cdc1c1f 3e23627 495ac8b cdc1c1f 3e23627 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
license: apache-2.0
library_name: transformers
language:
- en
base_model:
- Qwen/QwQ-32B
tags:
- int8
- qwen2
- qwq
- vllm
base_model_relation: quantized
pipeline_tag: text-generation
---
# QWQ-32B-INT8-W8A8

## Model Overview
- **Model Architecture:** Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- **Input:** Text
- **Output:** Text
- **Model Optimizations:**
- **Weight quantization:** INT8
- **Activation quantization:** INT8
- **Release Date:** 3/13/2025
INT8 quantized version of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).
### Model Optimizations
This model was obtained by quantizing the weights and activations of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.
Only the weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
## Use with vLLM
I deploy using the OpenAI-compatible [vLLM](https://docs.vllm.ai/en/latest/) Docker image, as shown in the example below.
```bash
#!/bin/bash
# Default values
NAME_SUFFIX=""
PORT=8010
GPUS="0,1" # Default GPUs
# Parse command line arguments
while getopts "s:p:g:" opt; do
case $opt in
s) NAME_SUFFIX="$OPTARG";; # suffix for container name
p) PORT="$OPTARG";; # port number
g) GPUS="$OPTARG";; # GPU devices (e.g., "2,3")
?) echo "Usage: $0 [-s suffix] [-p port] [-g gpus]"
exit 1;;
esac
done
model=ospatch/QwQ-32B-INT8-W8A8
volume=~/.cache/huggingface/hub
revision=main
version=latest
context=16384
base_name="vllm-qwq-int8"
container_name="${base_name}${NAME_SUFFIX}"
sudo docker run --restart=unless-stopped --name $container_name --runtime nvidia --gpus '"device='"$GPUS"'"' \
--shm-size 1g -p $PORT:8000 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=<user_token> \
-v $volume:/root/.cache/huggingface/hub vllm/vllm-openai:$version --model $model \
--revision $revision --tensor-parallel-size 2 \
--gpu-memory-utilization 0.97 --max-model-len $context --enable-chunked-prefill
```
No command line arguments are needed for the default configuration.
## Creation
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. Credit to [Neural Magic](https://huggingface.co/neuralmagic) for the recipe.
```python
## script copied from Neural Magic
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]
num_samples = 1024
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_stub)
device_map = calculate_offload_device_map(
model_stub,
reserve_for_hessians=True,
num_gpus=4,
torch_dtype="auto",
)
model = AutoModelForCausalLM.from_pretrained(
model_stub,
device_map=device_map,
torch_dtype="auto",
)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)
# Configure the quantization algorithm and scheme
recipe = [
SmoothQuantModifier(smoothing_strength=0.7),
QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.1,
),
]
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
# Save to disk in compressed-tensors format
save_path = model_name + "-INT8-W8A8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```
## Usage Guidelines
Please reference the model card for [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).
## Evaluation & Accuracy
The model passes the vibe check, but no attempt was made to evaluate the quantized model for accuracy loss. |