File size: 4,937 Bytes

---
license: apache-2.0
library_name: transformers
language:
- en
base_model:
- Qwen/QwQ-32B
tags:
- int8
- qwen2
- qwq
- vllm
base_model_relation: quantized
pipeline_tag: text-generation
---

# QWQ-32B-INT8-W8A8

![image/jpeg](https://i.imgur.com/WUQH51b.jpeg)

## Model Overview
- **Model Architecture:** Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Weight quantization:** INT8
  - **Activation quantization:** INT8
- **Release Date:** 3/13/2025

INT8 quantized version of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).


### Model Optimizations

This model was obtained by quantizing the weights and activations of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.


## Use with vLLM

I deploy using the OpenAI-compatible [vLLM](https://docs.vllm.ai/en/latest/) Docker image, as shown in the example below.

```bash
#!/bin/bash

# Default values
NAME_SUFFIX=""
PORT=8010
GPUS="0,1"  # Default GPUs

# Parse command line arguments
while getopts "s:p:g:" opt; do
    case $opt in
        s) NAME_SUFFIX="$OPTARG";;    # suffix for container name
        p) PORT="$OPTARG";;          # port number
        g) GPUS="$OPTARG";;          # GPU devices (e.g., "2,3")
        ?) echo "Usage: $0 [-s suffix] [-p port] [-g gpus]"
           exit 1;;
    esac
done

model=ospatch/QwQ-32B-INT8-W8A8
volume=~/.cache/huggingface/hub
revision=main
version=latest
context=16384
base_name="vllm-qwq-int8"
container_name="${base_name}${NAME_SUFFIX}"

sudo docker run --restart=unless-stopped --name $container_name --runtime nvidia --gpus '"device='"$GPUS"'"' \
     --shm-size 1g -p $PORT:8000 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=<user_token> \
     -v $volume:/root/.cache/huggingface/hub vllm/vllm-openai:$version --model $model \
     --revision $revision --tensor-parallel-size 2 \
     --gpu-memory-utilization 0.97 --max-model-len $context --enable-chunked-prefill
```

No command line arguments are needed for the default configuration.

## Creation

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. Credit to [Neural Magic](https://huggingface.co/neuralmagic) for the recipe.

```python
## script copied from Neural Magic

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# Load model
model_stub = "Qwen/QwQ-32B"
model_name = model_stub.split("/")[-1]

num_samples = 1024
max_seq_len = 8192

tokenizer = AutoTokenizer.from_pretrained(model_stub)

device_map = calculate_offload_device_map(
    model_stub,
    reserve_for_hessians=True,
    num_gpus=4,
    torch_dtype="auto",
)

model = AutoModelForCausalLM.from_pretrained(
    model_stub,
    device_map=device_map,
    torch_dtype="auto",
)

def preprocess_fn(example):
  return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

# Configure the quantization algorithm and scheme
recipe = [
    SmoothQuantModifier(smoothing_strength=0.7),
    QuantizationModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head"],
        dampening_frac=0.1,
    ),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds, 
    recipe=recipe,
    max_seq_length=max_seq_len,
    num_calibration_samples=num_samples,
)

# Save to disk in compressed-tensors format
save_path = model_name + "-INT8-W8A8"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to: {save_path}")
```

## Usage Guidelines

Please reference the model card for [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B).

## Evaluation & Accuracy

The model passes the vibe check, but no attempt was made to evaluate the quantized model for accuracy loss.