--- license: apache-2.0 library_name: transformers language: - en base_model: - Qwen/QwQ-32B tags: - int8 - qwen2 - qwq - vllm base_model_relation: quantized pipeline_tag: text-generation --- # QWQ-32B-INT8-W8A8 ![image/jpeg](https://i.imgur.com/WUQH51b.jpeg) ## Model Overview - **Model Architecture:** Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT8 - **Activation quantization:** INT8 - **Release Date:** 3/13/2025 INT8 quantized version of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B). ### Model Optimizations This model was obtained by quantizing the weights and activations of [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B) to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme. The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. ## Use with vLLM I deploy using the OpenAI-compatible [vLLM](https://docs.vllm.ai/en/latest/) Docker image, as shown in the example below. ```bash #!/bin/bash # Default values NAME_SUFFIX="" PORT=8010 GPUS="0,1" # Default GPUs # Parse command line arguments while getopts "s:p:g:" opt; do case $opt in s) NAME_SUFFIX="$OPTARG";; # suffix for container name p) PORT="$OPTARG";; # port number g) GPUS="$OPTARG";; # GPU devices (e.g., "2,3") ?) echo "Usage: $0 [-s suffix] [-p port] [-g gpus]" exit 1;; esac done model=ospatch/QwQ-32B-INT8-W8A8 volume=~/.cache/huggingface/hub revision=main version=latest context=16384 base_name="vllm-qwq-int8" container_name="${base_name}${NAME_SUFFIX}" sudo docker run --restart=unless-stopped --name $container_name --runtime nvidia --gpus '"device='"$GPUS"'"' \ --shm-size 1g -p $PORT:8000 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN= \ -v $volume:/root/.cache/huggingface/hub vllm/vllm-openai:$version --model $model \ --revision $revision --tensor-parallel-size 2 \ --gpu-memory-utilization 0.97 --max-model-len $context --enable-chunked-prefill ``` No command line arguments are needed for the default configuration. ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. Credit to [Neural Magic](https://huggingface.co/neuralmagic) for the recipe. ```python ## script copied from Neural Magic from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.compression.helpers import calculate_offload_device_map # Load model model_stub = "Qwen/QwQ-32B" model_name = model_stub.split("/")[-1] num_samples = 1024 max_seq_len = 8192 tokenizer = AutoTokenizer.from_pretrained(model_stub) device_map = calculate_offload_device_map( model_stub, reserve_for_hessians=True, num_gpus=4, torch_dtype="auto", ) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map=device_map, torch_dtype="auto", ) def preprocess_fn(example): return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)} ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") ds = ds.map(preprocess_fn) # Configure the quantization algorithm and scheme recipe = [ SmoothQuantModifier(smoothing_strength=0.7), QuantizationModifier( targets="Linear", scheme="W8A8", ignore=["lm_head"], dampening_frac=0.1, ), ] # Apply quantization oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, num_calibration_samples=num_samples, ) # Save to disk in compressed-tensors format save_path = model_name + "-INT8-W8A8" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ``` ## Usage Guidelines Please reference the model card for [QWQ-32B](https://huggingface.co/Qwen/QwQ-32B). ## Evaluation & Accuracy The model passes the vibe check, but no attempt was made to evaluate the quantized model for accuracy loss.