File size: 4,937 Bytes
cdc1c1f
 
 
 
 
 
 
 
 
 
 
 
3e23627
495ac8b
cdc1c1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e23627
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: apache-2.0
library_name: transformers
language:
- en
base_model:
- Qwen/QwQ-32B
tags:
- int8
- qwen2
- qwq
- vllm
base_model_relation: quantized
pipeline_tag: text-generation
---

# QWQ-32B-INT8-W8A8

![image/jpeg](https://i.imgur.com/WUQH51b.jpeg)

## Model Overview
- **Model Architecture:** Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Weight quantization:** INT8
  - **Activation quantization:** INT8
- **Release Date:** 3/13/2025

INT8 quantized version of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).


### Model Optimizations

This model was obtained by quantizing the weights and activations of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.


## Use with vLLM

I deploy using the OpenAI-compatible [vLLM](https://docs.vllm.ai/en/latest/) Docker image, as shown in the example below.

```bash
#!/bin/bash

# Default values
NAME_SUFFIX=""
PORT=8010
GPUS="0,1"  # Default GPUs

# Parse command line arguments
while getopts "s:p:g:" opt; do
    case $opt in
        s) NAME_SUFFIX="$OPTARG";;    # suffix for container name
        p) PORT="$OPTARG";;          # port number
        g) GPUS="$OPTARG";;          # GPU devices (e.g., "2,3")
        ?) echo "Usage: $0 [-s suffix] [-p port] [-g gpus]"
           exit 1;;
    esac
done

model=ospatch/QwQ-32B-INT8-W8A8
volume=~/.cache/huggingface/hub
revision=main
version=latest
context=16384
base_name="vllm-qwq-int8"
container_name="${base_name}${NAME_SUFFIX}"

sudo docker run --restart=unless-stopped --name $container_name --runtime nvidia --gpus '"device='"$GPUS"'"' \
     --shm-size 1g -p $PORT:8000 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=<user_token> \
     -v $volume:/root/.cache/huggingface/hub vllm/vllm-openai:$version --model $model \
     --revision $revision --tensor-parallel-size 2 \
     --gpu-memory-utilization 0.97 --max-model-len $context --enable-chunked-prefill
```

No command line arguments are needed for the default configuration.

## Creation

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. Credit to [Neural Magic](https://huggingface.co/neuralmagic) for the recipe.

```python
## script copied from Neural Magic

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=4,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7),
    QuantizationModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head"],
        dampening_frac=0.1,
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-INT8-W8A8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```

## Usage Guidelines

Please reference the model card for [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).

## Evaluation & Accuracy

The model passes the vibe check, but no attempt was made to evaluate the quantized model for accuracy loss.