NeMo
rlangman commited on
Commit
65e30f0
·
verified ·
1 Parent(s): 91867a9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nsclv1
4
+ license_link: https://developer.nvidia.com/downloads/license/nsclv1
5
+ ---
6
+
7
+
8
+ # NVIDIA Low Frame-rate Speech Codec
9
+ <style>
10
+ img {
11
+ display: inline-table;
12
+ vertical-align: small;
13
+ margin: 0;
14
+ padding: 0;
15
+ }
16
+ </style>
17
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-HiFi--GAN-lightgrey#model-badge)](#model-architecture)
18
+ | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
+ | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
+
21
+ The [NeMo Audio Codec](https://arxiv.org/abs/2406.05298) is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis.
22
+
23
+
24
+ ## Model Architecture
25
+ The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture similar to [HiFi-GAN-based](https://arxiv.org/abs/2010.05646).
26
+ For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and four dimensions per code and 1000 codes per codebook.
27
+
28
+ For more details please check [our paper](https://arxiv.org/abs/2406.05298).
29
+
30
+ ### Input
31
+ - **Input Type:** Audio
32
+ - **Input Format(s):** .wav files
33
+ - **Input Parameters:** One-Dimensional (1D)
34
+ - **Other Properties Related to Input:** 22050 Hz Mono-channel Audio
35
+
36
+ ### Output
37
+ - **Output Type**: Audio
38
+ - **Output Format:** .wav files
39
+ - **Output Parameters:** One Dimensional (1D)
40
+ - **Other Properties Related to Output:** 22050 Hz Mono-channel Audio
41
+
42
+
43
+ ## How to Use this Model
44
+
45
+ The model is available for use in the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
46
+
47
+ ### Inference
48
+ For inference, you can follow our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb) which automatically downloads the model checkpoint. Note that you will need to set the ```model_name``` parameter to "nvidia/audio-codec-22khz".
49
+
50
+ Alternatively, you can manually download the [checkpoint](https://huggingface.co/nvidia/Low-Frame-rate-Speech-Codec/resolve/main/Low-Frame-rate-Speech-Codec.nemo) and use the code below to make an inference on the model:
51
+
52
+ ```
53
+ import librosa
54
+ import torch
55
+ import soundfile as sf
56
+ from nemo.collections.tts.models import AudioCodecModel
57
+
58
+ codec_path = ??? # set here the model .nemo checkpoint path
59
+ path_to_input_audio = ??? # path of the input audio
60
+ path_to_output_audio = ??? # path of the reconstructed output audio
61
+
62
+ nemo_codec_model = AudioCodecModel.restore_from(restore_path=codec_path, map_location="cpu").eval()
63
+
64
+ # get discrete tokens from audio
65
+ audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
66
+
67
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
68
+ audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
69
+ audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
70
+
71
+ encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
72
+
73
+ # Reconstruct audio from tokens
74
+ reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
75
+
76
+ # save reconstructed audio
77
+ output_audio = reconstructed_audio.cpu().numpy().squeeze()
78
+ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
79
+
80
+ ```
81
+
82
+ ### Training
83
+ For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_low_frame_rate_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_low_frame_rate_22khz".
84
+
85
+
86
+ ## Training, Testing, and Evaluation Datasets:
87
+
88
+ The Low Frame-rate Speech Codec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to [our paper](https://arxiv.org/abs/2409.12117).
89
+
90
+
91
+ ### Training Datasets
92
+ The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
93
+
94
+ - [MLS English](https://www.openslr.org/94/) [25.5k]
95
+
96
+ - Data Collection Method: by Human
97
+
98
+ - Labeling Method: Automated
99
+
100
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)[3.2k]
101
+
102
+ - Data Collection Method: by Human
103
+
104
+ - Labeling Method: by Human
105
+
106
+
107
+
108
+ ### Test Datasets
109
+
110
+ - [MLS English](https://www.openslr.org/94/)
111
+
112
+ - Data Collection Method: by Human
113
+
114
+ - Labeling Method: Automated
115
+
116
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
117
+
118
+ - Data Collection Method: by Human
119
+
120
+ - Labeling Method: by Human
121
+
122
+
123
+ ## Software Integration
124
+
125
+ ### Supported Hardware Microarchitecture Compatibility:
126
+ - NVIDIA Ampere
127
+ - NVIDIA Blackwell
128
+ - NVIDIA Jetson
129
+ - NVIDIA Hopper
130
+ - NVIDIA Lovelace
131
+ - NVIDIA Pascal
132
+ - NVIDIA Turing
133
+ - NVIDIA Volta
134
+
135
+ ### Runtime Engine
136
+
137
+ - Nemo 2.0.0
138
+
139
+ ### Preferred Operating System
140
+
141
+ - Linux
142
+
143
+
144
+ ## License/Terms of Use
145
+ This model is for research and development only (non-commercial use) and the license to use this model is covered by the [NSCLv1](https://developer.nvidia.com/downloads/license/nsclv1).
146
+
147
+ ## Ethical Considerations:
148
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
149
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
150
+
151
+