nvidia
/

audio-codec-22khz

Model card Files Files and versions Community

rlangman commited on Dec 4, 2024

Commit

ef3dbab

·

verified ·

1 Parent(s): 620038a

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -98,6 +98,15 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
   - [MLS English](https://www.openslr.org/94/) - 15 hours, 42 speakers, English
   - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 2 hours, 1356 speakers, 59 languages
 ## Software Integration
 ### Supported Hardware Microarchitecture Compatibility:

   - [MLS English](https://www.openslr.org/94/) - 15 hours, 42 speakers, English
   - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 2 hours, 1356 speakers, 59 languages
+## Performance
+We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQO](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
+| Dataset     | ViSQOL     |PESQ        |ESTOI       |Mel Distance |STFT Distance|SI-SDR|
+|:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
+| MLS English | 4.50       | 3.69       | 0.94       | 0.066       | 0.033       | 8.33       |
+| CommonVoice | 4.53       | 3.55       | 0.93       | 0.100       | 0.057       | 7.63       |
 ## Software Integration
 ### Supported Hardware Microarchitecture Compatibility: