Commit
·
8f38fc7
1
Parent(s):
216736f
Update README.md
Browse files
README.md
CHANGED
@@ -15,16 +15,12 @@ metrics:
|
|
15 |
|
16 |
This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th). It was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
|
17 |
|
18 |
-
GitHub: [https://github.com/wannaphong/thai-wav2vec2-cv-v8](https://github.com/wannaphong/thai-wav2vec2-cv-v8)
|
19 |
-
|
20 |
## Datasets
|
21 |
|
22 |
It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.
|
23 |
|
24 |
It use [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) script for split Common Voice dataset.
|
25 |
|
26 |
-
You can read more at [wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)
|
27 |
-
|
28 |
## Models
|
29 |
|
30 |
This model was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model with Thai Common Voice V8 dataset and It use pre-tokenize with `pythainlp.tokenize.word_tokenize`.
|
@@ -37,39 +33,23 @@ I used many code from [vistec-AI/wav2vec2-large-xlsr-53-th](https://github.com/v
|
|
37 |
|
38 |
**Test with CommonVoice V8 Testset**
|
39 |
|
40 |
-
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|
41 |
-
|
42 |
-
|
|
43 |
-
| wav2vec2 with
|
44 |
-
|
|
|
|
|
|
45 |
|
46 |
**Test with CommonVoice V7 Testset (same test by CV V7)**
|
47 |
|
48 |
-
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|
49 |
-
|
50 |
-
|
|
51 |
-
| wav2vec2 with
|
52 |
-
|
|
|
|
|
|
53 |
|
54 |
-
This is use same testset from [https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).
|
55 |
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
## Files
|
60 |
-
- `0-download-unzip.ipynb` - notebook for download and unzip CommonVoice V8
|
61 |
-
- `1-convert-mp3-wav.ipynb` - notebook for convert mp3 files to wav files
|
62 |
-
- `1-preprocessing-thai-cv-v8-wav2vev2.ipynb` - notebook for preprocessing CommonVoice V8 (old file)
|
63 |
-
- `2-gen-val-json.py` - python file for get manifest in nvidia meno asr
|
64 |
-
- `2-preprocessing-thai-cv-v8-wav2vev2.ipynb` - notebook for preprocessing CommonVoice V8
|
65 |
-
- `4-gen-manifest.ipynb` - notebook for get manifest in nvidia meno asr
|
66 |
-
- `build-lm.ipynb` - notebook for build ASR LM
|
67 |
-
- `test-ai4thai.ipynb` - notebook for test AI For Thai.
|
68 |
-
- `test-google.ipynb` - notebook for test Google ASR.
|
69 |
-
- `test-v7.ipynb` - notebook for test [vistec-AI/wav2vec2-large-xlsr-53-th](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th) model.
|
70 |
-
- `test-wav2vec2-lm.ipynb` - notebook for test our model with LM.
|
71 |
-
- `test-wav2vec2.ipynb` - notebook for test our model without LM.
|
72 |
-
- `train-wav2vec2.py` - python file for training model.
|
73 |
-
|
74 |
-
**Links:**
|
75 |
-
- GitHub Dataset: [https://github.com/wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)
|
|
|
15 |
|
16 |
This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th). It was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
|
17 |
|
|
|
|
|
18 |
## Datasets
|
19 |
|
20 |
It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.
|
21 |
|
22 |
It use [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) script for split Common Voice dataset.
|
23 |
|
|
|
|
|
24 |
## Models
|
25 |
|
26 |
This model was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model with Thai Common Voice V8 dataset and It use pre-tokenize with `pythainlp.tokenize.word_tokenize`.
|
|
|
33 |
|
34 |
**Test with CommonVoice V8 Testset**
|
35 |
|
36 |
+
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|
37 |
+
|-----------------------|------------------|--------------------|----------|
|
38 |
+
| AIResearch.in.th and PyThaiNLP | 17.414503 | 11.923089 | 3.854153 |
|
39 |
+
| wav2vec2 with deepcut | 16.354521 | 11.424476 | 3.684060 |
|
40 |
+
| wav2vec2 with newmm | 16.698299 | 11.436941 | 3.737407 |
|
41 |
+
| wav2vec2 with deepcut + language model | 12.630260 | 9.613886 | 3.292073 |
|
42 |
+
| **wav2vec2 with newmm + language model** | 12.583706 | 9.598305 | 3.276610 |
|
43 |
|
44 |
**Test with CommonVoice V7 Testset (same test by CV V7)**
|
45 |
|
46 |
+
| Model | WER by newmm (%) | WER by deepcut (%) | CER |
|
47 |
+
|-----------------------|------------------|--------------------|----------|
|
48 |
+
| AIResearch.in.th and PyThaiNLP | 13.936698 | 9.347462 | 2.804787 |
|
49 |
+
| wav2vec2 with deepcut | 12.776381 | 8.773006 | 2.628882 |
|
50 |
+
| wav2vec2 with newmm | 12.750596 | 8.672616 | 2.623341 |
|
51 |
+
| wav2vec2 with deepcut + language model | 9.940050 | 7.423313 | 2.344940 |
|
52 |
+
| **wav2vec2 with newmm + language model** | 9.559724 | 7.339654 | 2.277071 |
|
53 |
|
|
|
54 |
|
55 |
+
This is use same testset from [https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|