Update README.md
Browse files
README.md
CHANGED
@@ -3,4 +3,28 @@ license: apache-2.0
|
|
3 |
base_model:
|
4 |
- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
|
5 |
pipeline_tag: automatic-speech-recognition
|
6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
base_model:
|
4 |
- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
|
5 |
pipeline_tag: automatic-speech-recognition
|
6 |
+
---
|
7 |
+
|
8 |
+
# General-purpose Latgalian ASR model
|
9 |
+
|
10 |
+
This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets:
|
11 |
+
- the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets),
|
12 |
+
- the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR).
|
13 |
+
|
14 |
+
## Training
|
15 |
+
|
16 |
+
As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.
|
17 |
+
|
18 |
+
| Training data | Hours |
|
19 |
+
|:---|---:|
|
20 |
+
| Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 |
|
21 |
+
| Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 |
|
22 |
+
| Total | 40.2 |
|
23 |
+
|
24 |
+
## Evaluation
|
25 |
+
|
26 |
+
TBA
|
27 |
+
|
28 |
+
## Acknowledgements
|
29 |
+
|
30 |
+
This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project [LATE](https://www.digitalhumanities.lv/projekti/vpp-late/) (VPP-LETONIKA-2021/1-0006).
|