lshzhm commited on
Commit
cf56b46
·
verified ·
1 Parent(s): 99bbd30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -74
README.md CHANGED
@@ -1,74 +1,77 @@
1
- <div align="center">
2
- <p align="center">
3
- <h2>DeepAudio-V1</h2>
4
- <a href="https://arxiv.org/">Paper</a> | <a href="https://pages.github.com/">Webpage</a> | <a href="https://huggingface.co/">Models</a>
5
- </p>
6
- </div>
7
-
8
-
9
- ## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
10
-
11
-
12
- ## Installation
13
-
14
- **1. Create a conda environment**
15
-
16
- ```bash
17
- conda create -n v2as python=3.10
18
- conda activate v2as
19
- ```
20
-
21
- **2. F5-TTS base install**
22
-
23
- ```bash
24
- cd ./F5-TTS
25
- pip install -e .
26
- ```
27
-
28
- **3. Additional requirements**
29
-
30
- ```bash
31
- pip install -r requirements.txt
32
- conda install cudnn
33
- ```
34
-
35
- **Pretrained models**
36
-
37
- The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
38
-
39
- ## Inference
40
-
41
- **1. V2A inference**
42
-
43
- ```bash
44
- bash v2a.sh
45
- ```
46
-
47
- **2. V2S inference**
48
-
49
- ```bash
50
- bash v2s.sh
51
- ```
52
-
53
- **3. TTS inference**
54
-
55
- ```bash
56
- bash tts.sh
57
- ```
58
-
59
- ## Evaluation
60
-
61
- ```bash
62
- bash eval_v2c.sh
63
- ```
64
-
65
-
66
- ## Acknowledgement
67
-
68
- - [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
69
- - [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
70
- - [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
71
- - [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
72
- - [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
73
- - [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
74
-
 
 
 
 
1
+ ---
2
+ title: DeepAudio-V1 — multi-modal speech and audio generation
3
+ emoji: 🔊
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
10
+
11
+
12
+ ## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
13
+
14
+
15
+ ## Installation
16
+
17
+ **1. Create a conda environment**
18
+
19
+ ```bash
20
+ conda create -n v2as python=3.10
21
+ conda activate v2as
22
+ ```
23
+
24
+ **2. F5-TTS base install**
25
+
26
+ ```bash
27
+ cd ./F5-TTS
28
+ pip install -e .
29
+ ```
30
+
31
+ **3. Additional requirements**
32
+
33
+ ```bash
34
+ pip install -r requirements.txt
35
+ conda install cudnn
36
+ ```
37
+
38
+ **Pretrained models**
39
+
40
+ The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
41
+
42
+ ## Inference
43
+
44
+ **1. V2A inference**
45
+
46
+ ```bash
47
+ bash v2a.sh
48
+ ```
49
+
50
+ **2. V2S inference**
51
+
52
+ ```bash
53
+ bash v2s.sh
54
+ ```
55
+
56
+ **3. TTS inference**
57
+
58
+ ```bash
59
+ bash tts.sh
60
+ ```
61
+
62
+ ## Evaluation
63
+
64
+ ```bash
65
+ bash eval_v2c.sh
66
+ ```
67
+
68
+
69
+ ## Acknowledgement
70
+
71
+ - [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
72
+ - [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
73
+ - [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
74
+ - [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
75
+ - [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
76
+ - [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
77
+