ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset

License: CC BY-NC 4.0 GitHub Paper Model size


Table of contents

  1. Model Description
  2. Documentation and Implementation
  3. Benchmark Results
  4. Usage
  5. Citation
  6. Contact

Model Description

ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.


Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.


Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the WENET framework.

STT Model Test-Clean Test-Other Avg.
1 ChunkFormer 2.69 6.91 4.80
2 Efficient Conformer 2.71 6.95 4.83
3 Conformer 2.77 6.93 4.85
4 Squeezeformer 2.87 7.16 5.02

Quick Usage

To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

  1. Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
  1. Download the Model Checkpoint from Hugging Face
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"

or

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

  1. Run the model
python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE


Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}

Contact

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results