AshenR commited on
Commit
21d5aaf
·
verified ·
1 Parent(s): 423f974

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -15
README.md CHANGED
@@ -3,35 +3,53 @@ license: unknown
3
  language:
4
  - si
5
  metrics:
6
- - perplexity
 
 
7
  library_name: transformers
 
 
 
 
8
  ---
9
 
10
- ### Overview
11
 
12
- This is a slightly smaller model trained on half of the [Fasttext](https://fasttext.cc/docs/en/crawl-vectors.html) dataset. Since Sinhala is classified as a low-resource language, there is a significant scarcity of pre-trained models available for it. This lack of resources creates a noticeable gap in the language's representation within the field of natural language processing (NLP). As a result, developing new models tailored for Sinhala presents a valuable opportunity. This model can act as foundational tools to enable further advancements in downstream tasks such as sentiment analysis, machine translation, named entity recognition, or question answering.
13
- ## Model Specification
14
 
 
 
 
 
 
 
 
 
 
15
 
16
- The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
17
- 1. vocab_size=52000
18
- 2. max_position_embeddings=514
19
- 3. num_attention_heads=12
20
- 4. num_hidden_layers=6
21
- 5. type_vocab_size=1
22
- Perplexity Value - 3.5
23
 
24
- ## How to Use
25
- You can use this model directly with a pipeline for masked language modeling:
 
 
 
 
26
 
27
- ```py
 
 
 
 
 
 
28
  from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
29
 
 
30
  model = AutoModelWithLMHead.from_pretrained("ashen/AshenBERTo")
31
  tokenizer = AutoTokenizer.from_pretrained("ashen/AshenBERTo")
32
 
 
33
  fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
34
 
 
35
  fill_mask("මම ගෙදර <mask>.")
36
-
37
  ```
 
3
  language:
4
  - si
5
  metrics:
6
+ - name :
7
+ value: 64.59
8
+
9
  library_name: transformers
10
+ tags:
11
+ - AshenBerto
12
+ - Sinhala
13
+ - Roberta
14
  ---
15
 
 
16
 
 
 
17
 
18
+ ### 🌟 Overview
19
+
20
+ This is a slightly smaller model trained on half of the [FastText](https://fasttext.cc/docs/en/crawl-vectors.html) dataset. Since Sinhala is a low-resource language, there’s a noticeable lack of pre-trained models available for it. 😕 This gap makes it harder to represent the language properly in the world of NLP.
21
+
22
+ But hey, that’s where this model comes in! 🚀 It opens up exciting opportunities to improve tasks like sentiment analysis, machine translation, named entity recognition, or even question answering—tailored just for Sinhala. 🇱🇰✨
23
+
24
+ ---
25
+
26
+ ### 🛠 Model Specs
27
 
28
+ Here’s what powers this model (we went with [RoBERTa](https://arxiv.org/abs/1907.11692)):
 
 
 
 
 
 
29
 
30
+ 1️⃣ **vocab_size** = 52,000
31
+ 2️⃣ **max_position_embeddings** = 514
32
+ 3️⃣ **num_attention_heads** = 12
33
+ 4️⃣ **num_hidden_layers** = 6
34
+ 5️⃣ **type_vocab_size** = 1
35
+ 🎯 **Perplexity Value**: 3.5
36
 
37
+ ---
38
+
39
+ ### 🚀 How to Use
40
+
41
+ You can jump right in and use this model for masked language modeling! 🧩
42
+
43
+ ```python
44
  from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
45
 
46
+ # Load the model and tokenizer
47
  model = AutoModelWithLMHead.from_pretrained("ashen/AshenBERTo")
48
  tokenizer = AutoTokenizer.from_pretrained("ashen/AshenBERTo")
49
 
50
+ # Create a fill-mask pipeline
51
  fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
52
 
53
+ # Try it out with a Sinhala sentence! 🇱🇰
54
  fill_mask("මම ගෙදර <mask>.")
 
55
  ```