|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19 |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# General-purpose Latgalian ASR model |
|
|
|
This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets: |
|
- the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets), |
|
- the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR). |
|
|
|
## Training |
|
|
|
As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library. |
|
|
|
| Training data | Hours | |
|
|:---|---:| |
|
| Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 | |
|
| Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 | |
|
| Total | 40.2 | |
|
|
|
## Evaluation |
|
|
|
| Testing data | WER | |
|
|:---|---:| |
|
| Latgalian CV 20.0 test set (1.5 hours) | 9.1 | |
|
| MuLaR test set (1.6 hours) | 25.7 | |
|
|
|
NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus. |
|
|
|
## Acknowledgements |
|
|
|
This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003). |
|
|