normundsg commited on
Commit
0426880
·
verified ·
1 Parent(s): 5892446

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -3,4 +3,28 @@ license: apache-2.0
3
  base_model:
4
  - AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
5
  pipeline_tag: automatic-speech-recognition
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  base_model:
4
  - AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
5
  pipeline_tag: automatic-speech-recognition
6
+ ---
7
+
8
+ # General-purpose Latgalian ASR model
9
+
10
+ This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets:
11
+ - the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets),
12
+ - the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR).
13
+
14
+ ## Training
15
+
16
+ As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.
17
+
18
+ | Training data | Hours |
19
+ |:---|---:|
20
+ | Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 |
21
+ | Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 |
22
+ | Total | 40.2 |
23
+
24
+ ## Evaluation
25
+
26
+ TBA
27
+
28
+ ## Acknowledgements
29
+
30
+ This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project [LATE](https://www.digitalhumanities.lv/projekti/vpp-late/) (VPP-LETONIKA-2021/1-0006).