NeMo
rlangman commited on
Commit
ef3dbab
·
verified ·
1 Parent(s): 620038a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -98,6 +98,15 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
98
  - [MLS English](https://www.openslr.org/94/) - 15 hours, 42 speakers, English
99
  - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 2 hours, 1356 speakers, 59 languages
100
 
 
 
 
 
 
 
 
 
 
101
  ## Software Integration
102
 
103
  ### Supported Hardware Microarchitecture Compatibility:
 
98
  - [MLS English](https://www.openslr.org/94/) - 15 hours, 42 speakers, English
99
  - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 2 hours, 1356 speakers, 59 languages
100
 
101
+ ## Performance
102
+
103
+ We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQO](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
104
+
105
+ | Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
106
+ |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
107
+ | MLS English | 4.50 | 3.69 | 0.94 | 0.066 | 0.033 | 8.33 |
108
+ | CommonVoice | 4.53 | 3.55 | 0.93 | 0.100 | 0.057 | 7.63 |
109
+
110
  ## Software Integration
111
 
112
  ### Supported Hardware Microarchitecture Compatibility: