AiLab-IMCS-UL
/

whisper-large-v3-latgalian-2503

Automatic Speech Recognition

Model card Files Files and versions Community

whisper-large-v3-latgalian-2503 / README.md

normundsg's picture

Update README.md

5638421 verified about 1 month ago

|

history blame contribute delete

1.69 kB

	---
	license: apache-2.0
	base_model:
	- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
	pipeline_tag: automatic-speech-recognition
	---

	# General-purpose Latgalian ASR model

	This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets:
	- the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets),
	- the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR).

	## Training

	As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.

	\| Training data \| Hours \|
	\|:---\|---:\|
	\| Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) \| 22.9 \|
	\| Corpus of Contemporary Latgalian Speech (MuLaR) train set \| 17.3 \|
	\| Total \| 40.2 \|

	## Evaluation

	\| Testing data \| WER \|
	\|:---\|---:\|
	\| Latgalian CV 20.0 test set (1.5 hours) \| 9.1 \|
	\| MuLaR test set (1.6 hours) \| 25.7 \|

	NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus.

	## Acknowledgements

	This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003).