File size: 4,299 Bytes
6618b0c
 
 
 
 
 
 
 
 
ce0ad50
 
6618b0c
 
d992292
6618b0c
 
e99afea
6618b0c
 
d90bc77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6618b0c
597e874
83b8dd3
2fa0424
9c87ca9
 
 
 
 
 
f77d676
597e874
bd4779d
7ee5da3
e99afea
6618b0c
 
a132468
6618b0c
 
cb4f793
a132468
 
229712a
9df5205
7ee5da3
9df5205
7ee5da3
 
 
229712a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
language:
- th
- en
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
library_name: transformers
metrics:
- wer
---

# Pathumma Whisper Large V3 (Th)

## Model Description
Additional information is needed

## Quickstart
You can transcribe audio files using the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class with the following code snippet:
```python
import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

lang = "th"
task = "transcribe"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="nectec/Pathumma-whisper-th-large-v3",
    torch_dtype=torch_dtype,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)

text = pipe("audio_path.wav")["text"]
print(text)
```

<!-- ## Evaluation Performance
WER calculated with newmm tokenizer for Thai word segmentation.
| Model                                   |       CV18 (WER)       |        Gowejee (WER)      |     LOTUS-TRD (WER)    |      Thai Dialect (WER)    |        Elderly (WER)       |     Gigaspeech2 (WER)      |       Fleurs (WER)         |     Distant Meeting (WER)  |       Podcast (WER)        |
|:----------------------------------------|:----------------------:|:-------------------------:|:----------------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------------------------:|
| whisper-large-v3                        |         18.75          |          46.59            |         48.14          |           57.82            |           12.27            |           33.26            |           24.08            |           72.57            |           41.24            |
| airesearch-wav2vec2-large-xlsr-53-th    |         8.49           |          17.28            |         63.01          |           48.53            |           11.29            |           52.72            |           37.32            |           85.11            |           65.12            |
| thonburian-whisper-th-large-v3-combined |         7.62           |          22.06            |         41.95          |           26.53            |           1.63             |           25.22            |           13.90            |           64.68            |           32.42            |
| monsoon-whisper-medium-gigaspeech2      |         11.66          |          20.50            |         41.04          |           42.06            |           7.57             |           21.40            |           21.54            |           51.65            |           38.89            |
| pathumma-whisper-th-large-v3            |         8.68           |           9.84            |         15.47          |           19.85            |           1.53             |           21.66            |           15.65            |           51.56            |           36.47            |

**Note:** Other models not target fine-tuned on dialect datasets may be less representative of dialect performance. -->

## Limitations and Future Work
Additional information is needed

## Acknowledgements
We extend our appreciation to the research teams engaged in the creation of the open speech model, including AIResearch, BiodatLab, Looloo Technology, SCB 10X, and OpenAI. We would like to express our gratitude to Dr. Titipat Achakulwisut of BiodatLab for the evaluation pipeline. We express our gratitude to ThaiSC, or NSTDA Supercomputer Centre, for supplying the LANTA used for model training, fine-tuning, and evaluation.

## Pathumma Audio Team
*Pattara Tipaksorn*, Wayupuk Sommuang, Oatsada Chatthong, *Kwanchiva Thangthai*

## Citation
```
@misc{tipaksorn2024PathummaWhisper,
    title        = { {Pathumma Whisper Large V3 (TH)} },
    author       = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
    url          = { https://huggingface.co/nectec/Pathumma-whisper-th-large-v3 },
    publisher    = { Hugging Face },
    year         = { 2024 },
}
```