NeMo
rlangman commited on
Commit
0eaf726
·
verified ·
1 Parent(s): c40f87f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -18,14 +18,12 @@ padding: 0;
18
  | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
 
21
- The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. This model can be used as a vocoder for speech synthesis.
22
-
23
 
24
  ## Model Architecture
25
- The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
26
- For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
27
 
28
- For more details please check [our paper](https://arxiv.org/abs/2406.05298).
29
 
30
  ### Input
31
  - **Input Type:** Audio
@@ -80,8 +78,8 @@ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
80
  ```
81
 
82
  ### Training
83
- For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_22khz".
84
 
 
85
 
86
  ## Training, Testing, and Evaluation Datasets:
87
 
 
18
  | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
  | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
 
21
+ The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis. The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
 
22
 
23
  ## Model Architecture
24
+ The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646). We use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, 1000 entries per codebook, 86.1 frames per second, and a 6.9kbps bitrate.
 
25
 
26
+ For more details please refer to [our paper](https://arxiv.org/abs/2406.05298).
27
 
28
  ### Input
29
  - **Input Type:** Audio
 
78
  ```
79
 
80
  ### Training
 
81
 
82
+ For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio-codec-22khz".
83
 
84
  ## Training, Testing, and Evaluation Datasets:
85