Quantization for OpenAI's Whisper Models: A Comparative Analysis
Abstract
Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git
Community
With the rise of LLM powered ASR translation devices, how will the ability to translate and detect audio change once they can be deployed on edge devices?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration (2025)
- LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation (2025)
- On the Robust Approximation of ASR Metrics (2025)
- Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio (2025)
- Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models (2025)
- Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance (2025)
- ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper