metadata

language: en
model_name: Wav2Vec2-BART (Base) English ASR - VoxPopuli Best WER
license: mit
tags:
  - automatic-speech-recognition
  - speech-encoder-decoder
  - wav2vec2
  - bart
  - english
  - voxpopuli
  - generated_from_trainer
  - audio
  - master-thesis
datasets:
  - facebook/voxpopuli
base_model:
  - facebook/wav2vec2-base-en-voxpopuli-v2
  - facebook/bart-base
model-index:
  - name: matejhornik/wav2vec2-base_bart-base_voxpopuli-en
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: VoxPopuli (English, Test)
          type: facebook/voxpopuli
          config: en
          split: test
        metrics:
          - name: WER
            type: wer
            value: 0.08848048503220916
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: VoxPopuli (English, Validation)
          type: facebook/voxpopuli
          config: en
          split: validation
        metrics:
          - name: WER
            type: wer
            value: 0.08554638942253362
pipeline_tag: automatic-speech-recognition
library_name: transformers

Wav2Vec2-BART (Base) for English ASR on VoxPopuli - Best WER from Master's Thesis

This repository contains the checkpoint for a SpeechEncoderDecoderModel fine-tuned for Automatic Speech Recognition (ASR) on the English portion of the VoxPopuli dataset. This model achieved the best Word Error Rate (WER) of 8.85% on the VoxPopuli English test set within the experimental framework of the Master's thesis "Effective Training of Neural Networks for Automatic Speech Recognition" by Matej Horník.

The model leverages a pre-trained Wav2Vec2 (Base) encoder (facebook/wav2vec2-base-en-voxpopuli-v2) and a pre-trained BART (Base) decoder (facebook/bart-base), connected via convolutional adapter layers.

Thesis Context

This model is a direct result of work conducted for the Master's thesis:

Title: Effective Training of Neural Networks for Automatic Speech Recognition
Author: Matej Horník
Supervisor: Ing. Alexander Polok
Institution: Brno University of Technology, Faculty of Information Technology
Year: 2025
Thesis Link: Link to thesis PDF

Link will be available after the thesis defense.

Thesis Abstract (English)

This master's thesis focuses on improving the training efficiency and performance of encoder-decoder transformer models for Automatic Speech Recognition (ASR). It investigates the impact of initialization strategies using pre-trained components (Wav2Vec2, BART), the role of convolutional adapters, and Parameter-Efficient Fine-tuning (PEFT) methods like LoRA and DoRA. Experiments on LibriSpeech and VoxPopuli datasets confirmed that full pre-trained initialization is crucial for best Word Error Rate (WER) and convergence. An optimal number of adapters improved performance, while PEFT (especially LoRA) significantly reduced trainable parameters with comparable accuracy. Domain-specific encoder pre-training proved beneficial, and the encoder-decoder model outperformed a CTC baseline in accuracy, offering practical insights for efficient ASR training.

Model Details

Encoder: facebook/wav2vec2-base-en-voxpopuli-v2. This is a Wav2Vec2 (Base) model pre-trained by Facebook on 24.1k hours of unlabeled English VoxPopuli data.
Decoder: facebook/bart-base. This is a standard BART (Base) model.
Architecture: SpeechEncoderDecoderModel from Hugging Face Transformers.
Adapters: 3 convolutional adapter layers were added to the encoder's output to better align its temporal resolution with the BART decoder's input requirements.
Feature Extractor: The Wav2Vec2 feature extractor (initial CNN layers) was kept frozen during fine-tuning, as experiments showed this maintained performance while reducing trainable parameters.

Initial Model Construction

The base model (before fine-tuning for this specific result) was constructed by combining the pre-trained facebook/wav2vec2-base-en-voxpopuli-v2 (encoder) and facebook/bart-base (decoder) using SpeechEncoderDecoderModel.from_encoder_decoder_pretrained. To create the model, code is provided in create_model.py.

python create_model.py

Training Data

Data

The model was fine-tuned on the train split of the English portion of the VoxPopuli dataset (facebook/voxpopuli, config en). Audio data was resampled to 16kHz. Text transcriptions were tokenized using the BART tokenizer and lowercased.

Procedure

The model was fine-tuned using modified run_speech_recognition_seq2seq.py script (provided in the thesis materials, based on Hugging Face's example scripts).

Key Hyperparameters:

Optimizer: AdamW
Learning Rate: 1e-4
LR Scheduler: cosine_with_min_lr (min_lr: 5e-9)
Warmup Steps: 2000
Batch Size (per device): 96
Gradient Accumulation Steps: 1
Number of Epochs: 20
Weight Decay: 0.01
Label Smoothing Factor: 0.05
Mixed Precision: bf16
SpecAugment: Applied during training
- mask_time_prob: 0.25, mask_time_length: 30, mask_time_min_masks: 2
- mask_feature_prob: 0.3, mask_feature_length: 30, mask_feature_min_masks: 1
Feature Extractor: Frozen

The full training command can be found in the thesis materials, including the specific arguments used.

Evaluation

The model achieves the following Word Error Rate (WER) on the VoxPopuli English dataset:

Dataset Split	WER (%)	Loss
Validation	8.55%	1.056
Test	8.85%	1.076

For detailed training logs, metrics, and visualizations, please refer to the Weights & Biases report:

How to Use

You can use this model for inference with the Hugging Face transformers library. Make sure you have torchaudio and librosa (or soundfile) installed for audio processing.

from transformers import SpeechEncoderDecoderModel, AutoProcessor
import torch
import soundfile as sf

model_id = "matejhornik/wav2vec2-base_bart-base_voxpopuli-en"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the processor (feature extractor and tokenizer)
processor = AutoProcessor.from_pretrained(model_id)

# Load the model
model = SpeechEncoderDecoderModel.from_pretrained(model_id).to(device)

def transcribe_audio(audio_path):
    """Loads audio, processes it, and transcribes it."""
    speech_array, sampling_rate = sf.read(audio_path)

    # Ensure audio is 16kHz as expected by the model
    if sampling_rate != processor.feature_extractor.sampling_rate:
        raise ValueError(f"Audio sampling rate {sampling_rate} does not match model's required {processor.feature_extractor.sampling_rate}Hz. Please resample.")

    # Preprocess the audio
    inputs = processor(speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)
    input_features = inputs.input_features.to(device)
    attention_mask = inputs.attention_mask.to(device)

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features, attention_mask=attention_mask, max_length=128)

    # Decode the transcription
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

# Example usage:
audio_file_path = "path/to/your/audio.wav"
try:
   transcription = transcribe_audio(audio_file_path)
   print(f"Transcription: {transcription}")
except ValueError as e:
   print(e)
except FileNotFoundError:
   print(f"Audio file not found at: {audio_file_path}. Please provide a valid path.")

Reproducing Evaluation on VoxPopuli

To reproduce the evaluation results on the VoxPopuli test set:

from datasets import load_dataset
from transformers import SpeechEncoderDecoderModel, AutoProcessor
import torch
from jiwer import wer
from tqdm import tqdm

model_id = "matejhornik/wav2vec2-base_bart-base_voxpopuli-en"
dataset_name = "facebook/voxpopuli"
dataset_config = "en"
split = "test" # or "validation"

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load processor and model
processor = AutoProcessor.from_pretrained(model_id)
model = SpeechEncoderDecoderModel.from_pretrained(model_id).to(device)
model.eval() # Set model to evaluation mode

# Load dataset
# Note: You might need to authenticate with Hugging Face if the dataset requires it
# from huggingface_hub import login
voxpopuli_test = load_dataset(dataset_name, dataset_config, split=split, streaming=False) # Set streaming=True for large datasets if memory is an issue

# Preprocessing function
def map_to_pred(batch):
    # Ensure audio is in the correct format (array, 16kHz)
    audio_data = batch["audio"]["array"]
    sampling_rate = batch["audio"]["sampling_rate"]

    if sampling_rate != processor.feature_extractor.sampling_rate:
        print(f"Warning: Resampling needed or sample skipped for audio with rate {sampling_rate}")
        # Dummy processing for now if rate mismatch
        input_features = torch.zeros((1,1000)) # Placeholder
    else:
        inputs = processor(audio_data, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
        input_features = inputs.input_features.to(device)

    with torch.no_grad():
        predicted_ids = model.generate(input_features, max_length=128)

    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    batch["prediction"] = transcription[0]
    batch["reference"] = batch["normalized_text"]
    return batch


predictions = []
references = []

for sample in tqdm(voxpopuli_test):
    try:
        processed_sample = map_to_pred(sample)
        predictions.append(processed_sample["prediction"])
        references.append(processed_sample["reference"])
    except Exception as e:
        print(f"Error processing sample: {e}")


# Calculate WER
if predictions and references:
    current_wer = wer(references, predictions)
    print(f"WER on {split} set: {current_wer:.4f}")
else:
    print("No samples processed or an error occurred.")

# Expected WER on test set: 0.0885
# Expected WER on validation set: 0.0855

Framework Versions

This model was trained using:

Python: ^3.10
Transformers: ~4.46.3
PyTorch: ~2.5.1
Datasets: ^3.2.0
PEFT: ^0.14.0
Accelerate: ^1.4.0
Evaluate: ^0.4.3
WandB: ^0.19.7

Citation

Citation If you use this model or findings from the thesis, please cite:

@mastersthesis{Hornik2025EffectiveTraining,
  author       = {Horník, Matej},
  title        = {Effective Training of Neural Networks for Automatic Speech Recognition},
  school       = {Brno University of Technology, Faculty of Information Technology},
  year         = {2025},
  supervisor   = {Polok, Alexander},
  type         = {Master's Thesis},
  note         = {Online. Available at: \url{https://www.vut.cz/en/students/final-thesis/detail/164401}}
}

Acknowledgements

My supervisor, Ing. Alexander Polok, for his valuable guidance and support.
The Hugging Face team for their comprehensive transformers, datasets, and evaluate libraries.
The creators of Wav2Vec2, BART, and the VoxPopuli dataset.

Contact

For questions, feedback, or collaboration opportunities related to this thesis or any other stuff, feel free to reach out:

Email: [email protected] / [email protected]
GitHub: hornikmatej