Update README.md
Browse files
README.md
CHANGED
@@ -18,14 +18,12 @@ padding: 0;
|
|
18 |
| [](#model-architecture)
|
19 |
| [](#datasets)
|
20 |
|
21 |
-
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation.
|
22 |
-
|
23 |
|
24 |
## Model Architecture
|
25 |
-
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
|
26 |
-
For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
|
27 |
|
28 |
-
For more details please
|
29 |
|
30 |
### Input
|
31 |
- **Input Type:** Audio
|
@@ -80,8 +78,8 @@ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
|
|
80 |
```
|
81 |
|
82 |
### Training
|
83 |
-
For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_22khz".
|
84 |
|
|
|
85 |
|
86 |
## Training, Testing, and Evaluation Datasets:
|
87 |
|
|
|
18 |
| [](#model-architecture)
|
19 |
| [](#datasets)
|
20 |
|
21 |
+
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis. The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
|
|
|
22 |
|
23 |
## Model Architecture
|
24 |
+
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646). We use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, 1000 entries per codebook, 86.1 frames per second, and a 6.9kbps bitrate.
|
|
|
25 |
|
26 |
+
For more details please refer to [our paper](https://arxiv.org/abs/2406.05298).
|
27 |
|
28 |
### Input
|
29 |
- **Input Type:** Audio
|
|
|
78 |
```
|
79 |
|
80 |
### Training
|
|
|
81 |
|
82 |
+
For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio-codec-22khz".
|
83 |
|
84 |
## Training, Testing, and Evaluation Datasets:
|
85 |
|