Overview

This hub features the pre-trained model by DiariZen. The EEND component is built upon WavLM-Large and Conformer layers. The model was pre-trained on far-field, single-channel audio from a diverse set of public datasets, including AMI, AISHELL-4, AliMeeting, NOTSOFAR-1, MSDWild, DIHARD3, RAMC, and VoxConverse. Then structured pruning at 80% sparsity is applied. Finally, the pruned model is fine-tuned with MLC-SLM data.

Usage

from diarizen.pipelines.inference import DiariZenPipeline

# load pre-trained model
diar_pipeline = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-mlc")
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav')

# print results
for turn, _, speaker in diar_results.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

# load pre-trained model and save RTTM result
diar_pipeline = DiariZenPipeline.from_pretrained(
        "BUT-FIT/diarizen-wavlm-large-s80-mlc",
        rttm_out_dir='.'
)
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav', sess_name='session_name')

Results

DER evaluation of Pyannote baseline and DiariZen, with no collar applied.

Dataset Pyannote DiariZen
English-American 20.18 15.88
English-Australian 13.76 10.82
English-British 18.85 12.07
English-Filipino 13.19 10.28
English-Indian 8.19 6.04
French 22.62 17.33
German 22.33 16.35
Italian 10.64 8.85
Japanese 26.46 17.81
Korean 23.25 16.36
Portuguese 17.60 14.77
Russian 11.37 9.99
Spanish 12.92 10.82
Thai 10.90 10.62
Vietnamese 14.64 12.69
Average 16.44 12.71
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BUT-FIT/diarizen-wavlm-large-s80-mlc