dtamayo commited on
Commit
a8ef4f0
Β·
verified Β·
1 Parent(s): 0cc8670

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,398 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - bg
5
+ - ca
6
+ - code
7
+ - cs
8
+ - cy
9
+ - da
10
+ - de
11
+ - el
12
+ - en
13
+ - es
14
+ - et
15
+ - eu
16
+ - fi
17
+ - fr
18
+ - ga
19
+ - gl
20
+ - hr
21
+ - hu
22
+ - it
23
+ - lt
24
+ - lv
25
+ - mt
26
+ - nl
27
+ - nn
28
+ - \no
29
+ - oc
30
+ - pl
31
+ - pt
32
+ - ro
33
+ - ru
34
+ - sh
35
+ - sk
36
+ - sl
37
+ - sr
38
+ - sv
39
+ - uk
40
+ tags:
41
+ - roberta
42
+ - mroberta
43
+ - fill-mask
44
+ license: apache-2.0
45
+ library_name: transformers
46
+ ---
47
+
48
+
49
+ # mRoBERTa Model Card
50
+
51
+ mRoBERTa is a new **multilingual foundational model** based on the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) architecture. It has been pretrained from scratch using 35 European languages and code. The pretraining corpus consists of **12.8TB of high-quality data**. This is significantly larger compared to previous state-of-the-art encoder-only foundational models like [XLM-RoBERTa-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large), whose training datasets included less multilingual data, amounting to 2.5TB.
52
+
53
+ ## Technical Description
54
+
55
+ Technical details of the mRoBERTa model.
56
+
57
+ | Description | Value |
58
+ |-------------------------|:--------------|
59
+ | Model Parameters | 283M |
60
+ | Tokenizer Type | SPM |
61
+ | Vocabulary size | 256,000 |
62
+ | Precision | bfloat16 |
63
+ | Context length | 512 |
64
+
65
+
66
+ Training Hyperparemeters
67
+
68
+ | Hyperparameter | Value |
69
+ |------------------------- |:-------------- |
70
+ | Pretraining Objective | Masked Language Modeling |
71
+ | Learning Rate | 7E-05 |
72
+ | Learning Rate Scheduler | Cosine |
73
+ | Warmup | 10k |
74
+ | Optimizer | AdamW |
75
+ | Optimizer Hyperparameters | AdamW (Ξ²1=0.9,Ξ²2=0.98,Ξ΅ =1e-06 ) |
76
+ | Optimizer Decay | 1E-02 |
77
+ | Global Batch Size | 8192 |
78
+ | Dropout | 1E-01 |
79
+ | Attention Dropout | 1E-01 |
80
+ | Activation Function | GeLU |
81
+
82
+
83
+
84
+
85
+
86
+
87
+
88
+ ## Data
89
+
90
+ ### Pretraining Corpus
91
+
92
+ The training corpus consists of **35 European languages** and 92 programming languages, amounting to a total of **12.8TB** of high-quality data.
93
+
94
+ This highly multilingual corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 66.06% of the total tokens.
95
+ Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
96
+ The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
97
+ Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
98
+ These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
99
+ The remaining 10% comes from smaller sources in various languages.
100
+
101
+
102
+
103
+ The final pretraining language distribution split by language can be seen in the following picture:
104
+ <img src="./images/roberta_pretraining_lang_distribution.png" alt="drawing"/>
105
+
106
+ <small>Further details about the pretraining corpus can be found [here](https://huggingface.co/BSC-LT/ALIA-40b), as it is the same as the Salamandra foundational model.</small>
107
+
108
+
109
+ # Multilingual Evaluation and Performance
110
+
111
+ Evaluation is done using multilingual benchmarks in order to assess the multilingual capabilities of the models.
112
+
113
+ The following multilingual benchmarks have been considered:
114
+
115
+ | Benchmark | Description | Languages | Source |
116
+ |------------------|-------------|-----------|--------------|
117
+ | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
118
+ | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://club.aina.bsc.es/datasets.html) |
119
+ | Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
120
+ | Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
121
+
122
+
123
+ The following base foundational models have been considered:
124
+
125
+
126
+
127
+
128
+ | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
129
+ |---------------------------------|----------------------|------------|-------------|
130
+ | [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) | 126M | 52K | BERTa is a Catalan-specific language model pretrained with Catalan-only data. |
131
+ | [BERTinho](https://huggingface.co/dvilares/bertinho-gl-base-cased) | 109M | 30K | BERTinho is monolingual BERT model for Galician language. |
132
+ | [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) | 178M | 120K | Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia. |
133
+ | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. |
134
+ | [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) | 125M | 50K | RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019. |
135
+ | [RoBERTa-ca](https://huggingface.co/BSC-LT/RoBERTa-ca) | 125M | 50K | RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa. |
136
+ | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
137
+ | [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | 561M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
138
+
139
+
140
+
141
+ ## RESULTS
142
+
143
+ This section presents results across various multilingual benchmarks, with the maximum values highlighted in bold and the second-highest values underlined.
144
+
145
+ ### XTREME Benchmark
146
+
147
+ The Cross-lingual TRansfer Evaluation of Multilingual Encoders ([XTREME](https://github.com/google-research/xtreme)) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, availability of training data, and overlap with the languages present during pre-training of the models.
148
+
149
+ #### πŸ”΅ Sentence Classification
150
+
151
+ ##### πŸ”΅ XNLI
152
+ Metric used: Accuracy.
153
+
154
+ <table>
155
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
156
+ <tr><td>bg</td><td style=''>69.34</td><td style='text-decoration: underline;'>78.26</td><td style='font-weight: bold;'>82.10</td><td style=''>77.56</td></tr>
157
+ <tr><td>de</td><td style=''>71.54</td><td style=''>76.75</td><td style='font-weight: bold;'>81.62</td><td style='text-decoration: underline;'>77.01</td></tr>
158
+ <tr><td>el</td><td style=''>66.51</td><td style='text-decoration: underline;'>76.37</td><td style='font-weight: bold;'>81.46</td><td style=''>76.35</td></tr>
159
+ <tr><td>en</td><td style=''>82.20</td><td style=''>84.45</td><td style='font-weight: bold;'>87.98</td><td style='text-decoration: underline;'>85.69</td></tr>
160
+ <tr><td>es</td><td style=''>74.81</td><td style=''>78.18</td><td style='font-weight: bold;'>83.65</td><td style='text-decoration: underline;'>79.66</td></tr>
161
+ <tr><td>fr</td><td style=''>74.25</td><td style=''>78.24</td><td style='font-weight: bold;'>82.71</td><td style='text-decoration: underline;'>79.16</td></tr>
162
+ <tr><td>ru</td><td style=''>68.56</td><td style='text-decoration: underline;'>76.21</td><td style='font-weight: bold;'>79.10</td><td style=''>74.73</td></tr>
163
+ </table>
164
+
165
+ ##### πŸ”΅ PAWS-X
166
+ Metric used: Accuracy.
167
+
168
+ <table>
169
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
170
+ <tr><td>de</td><td style=''>85.65</td><td style='text-decoration: underline;'>86.95</td><td style=''>85.05</td><td style='font-weight: bold;'>87.35</td></tr>
171
+ <tr><td>en</td><td style=''>93.50</td><td style='text-decoration: underline;'>93.90</td><td style=''>91.45</td><td style='font-weight: bold;'>94.75</td></tr>
172
+ <tr><td>es</td><td style=''>87.75</td><td style='font-weight: bold;'>89.30</td><td style=''>87.65</td><td style='text-decoration: underline;'>88.60</td></tr>
173
+ <tr><td>fr</td><td style=''>86.60</td><td style='text-decoration: underline;'>88.55</td><td style=''>87.30</td><td style='font-weight: bold;'>89.20</td></tr>
174
+ </table>
175
+
176
+
177
+
178
+ #### 🟣 Structured Prediction: POS
179
+
180
+ ##### 🟣 POS (UDPOS)
181
+ Metric used: F1.
182
+
183
+ <table>
184
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
185
+ <tr><td>bg</td><td style=''>85.14</td><td style='text-decoration: underline;'>88.62</td><td style='font-weight: bold;'>89.06</td><td style=''>88.19</td></tr>
186
+ <tr><td>de</td><td style=''>85.71</td><td style=''>88.41</td><td style='font-weight: bold;'>88.65</td><td style='text-decoration: underline;'>88.58</td></tr>
187
+ <tr><td>el</td><td style=''>80.92</td><td style='font-weight: bold;'>87.12</td><td style=''>86.55</td><td style='text-decoration: underline;'>87.03</td></tr>
188
+ <tr><td>en</td><td style=''>95.43</td><td style=''>95.79</td><td style='font-weight: bold;'>96.07</td><td style='text-decoration: underline;'>95.85</td></tr>
189
+ <tr><td>es</td><td style=''>85.85</td><td style='text-decoration: underline;'>88.10</td><td style='font-weight: bold;'>89.31</td><td style=''>87.45</td></tr>
190
+ <tr><td>et</td><td style=''>79.68</td><td style=''>86.22</td><td style='font-weight: bold;'>87.36</td><td style='text-decoration: underline;'>86.25</td></tr>
191
+ <tr><td>eu</td><td style=''>60.18</td><td style=''>68.83</td><td style='font-weight: bold;'>71.85</td><td style='text-decoration: underline;'>69.22</td></tr>
192
+ <tr><td>fi</td><td style=''>79.72</td><td style='text-decoration: underline;'>85.90</td><td style='font-weight: bold;'>86.54</td><td style=''>84.23</td></tr>
193
+ <tr><td>fr</td><td style=''>81.20</td><td style=''>86.34</td><td style='font-weight: bold;'>88.24</td><td style='text-decoration: underline;'>87.00</td></tr>
194
+ <tr><td>hu</td><td style=''>78.39</td><td style='text-decoration: underline;'>83.05</td><td style='font-weight: bold;'>83.84</td><td style=''>82.96</td></tr>
195
+ <tr><td>it</td><td style=''>87.86</td><td style=''>88.91</td><td style='font-weight: bold;'>90.01</td><td style='text-decoration: underline;'>89.11</td></tr>
196
+ <tr><td>lt</td><td style=''>78.59</td><td style='text-decoration: underline;'>83.86</td><td style='font-weight: bold;'>84.91</td><td style=''>81.12</td></tr>
197
+ <tr><td>nl</td><td style=''>88.59</td><td style=''>89.16</td><td style='font-weight: bold;'>89.70</td><td style='text-decoration: underline;'>89.31</td></tr>
198
+ <tr><td>pl</td><td style=''>80.34</td><td style='text-decoration: underline;'>84.61</td><td style='font-weight: bold;'>85.77</td><td style=''>84.23</td></tr>
199
+ <tr><td>pt</td><td style=''>85.77</td><td style='text-decoration: underline;'>87.53</td><td style='font-weight: bold;'>88.56</td><td style=''>87.18</td></tr>
200
+ <tr><td>ro</td><td style=''>76.51</td><td style='text-decoration: underline;'>83.99</td><td style='font-weight: bold;'>86.47</td><td style=''>82.74</td></tr>
201
+ <tr><td>ru</td><td style=''>85.36</td><td style=''>88.75</td><td style='font-weight: bold;'>89.83</td><td style='text-decoration: underline;'>89.09</td></tr>
202
+ <tr><td>uk</td><td style=''>80.63</td><td style=''>84.79</td><td style='font-weight: bold;'>85.84</td><td style='text-decoration: underline;'>85.19</td></tr>
203
+ </table>
204
+
205
+
206
+ ##### 🟣 NER (PANX)
207
+ Metric used: F1.
208
+
209
+ <table>
210
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
211
+ <tr><td>bg</td><td style=''>78.38</td><td style=''>76.52</td><td style='font-weight: bold;'>81.97</td><td style='text-decoration: underline;'>78.66</td></tr>
212
+ <tr><td>de</td><td style='font-weight: bold;'>78.89</td><td style=''>73.92</td><td style='text-decoration: underline;'>78.59</td><td style=''>78.17</td></tr>
213
+ <tr><td>el</td><td style=''>74.09</td><td style=''>73.07</td><td style='font-weight: bold;'>75.49</td><td style='text-decoration: underline;'>74.81</td></tr>
214
+ <tr><td>en</td><td style='font-weight: bold;'>84.69</td><td style=''>82.70</td><td style='text-decoration: underline;'>84.50</td><td style=''>83.56</td></tr>
215
+ <tr><td>es</td><td style=''>72.32</td><td style=''>72.83</td><td style='text-decoration: underline;'>73.46</td><td style='font-weight: bold;'>78.30</td></tr>
216
+ <tr><td>et</td><td style='text-decoration: underline;'>77.55</td><td style=''>72.56</td><td style='font-weight: bold;'>78.37</td><td style=''>73.92</td></tr>
217
+ <tr><td>eu</td><td style='font-weight: bold;'>66.52</td><td style=''>58.34</td><td style='text-decoration: underline;'>60.01</td><td style=''>56.74</td></tr>
218
+ <tr><td>fi</td><td style='text-decoration: underline;'>78.11</td><td style=''>74.98</td><td style='font-weight: bold;'>78.46</td><td style=''>76.42</td></tr>
219
+ <tr><td>fr</td><td style='text-decoration: underline;'>79.45</td><td style=''>77.00</td><td style='font-weight: bold;'>80.16</td><td style=''>76.94</td></tr>
220
+ <tr><td>hu</td><td style='text-decoration: underline;'>77.39</td><td style=''>75.48</td><td style='font-weight: bold;'>80.10</td><td style=''>73.31</td></tr>
221
+ <tr><td>it</td><td style='font-weight: bold;'>81.33</td><td style=''>76.68</td><td style='text-decoration: underline;'>80.60</td><td style=''>80.04</td></tr>
222
+ <tr><td>lt</td><td style='text-decoration: underline;'>75.48</td><td style=''>73.76</td><td style='font-weight: bold;'>76.41</td><td style=''>72.71</td></tr>
223
+ <tr><td>nl</td><td style='text-decoration: underline;'>82.40</td><td style=''>79.80</td><td style='font-weight: bold;'>82.92</td><td style=''>81.42</td></tr>
224
+ <tr><td>pl</td><td style='font-weight: bold;'>80.57</td><td style=''>77.15</td><td style='text-decoration: underline;'>80.55</td><td style=''>80.26</td></tr>
225
+ <tr><td>pt</td><td style='text-decoration: underline;'>79.66</td><td style=''>76.60</td><td style='font-weight: bold;'>80.97</td><td style=''>76.13</td></tr>
226
+ <tr><td>ro</td><td style='text-decoration: underline;'>74.73</td><td style=''>71.79</td><td style='font-weight: bold;'>81.42</td><td style=''>66.85</td></tr>
227
+ <tr><td>ru</td><td style=''>65.42</td><td style=''>63.93</td><td style='font-weight: bold;'>70.68</td><td style='text-decoration: underline;'>67.53</td></tr>
228
+ <tr><td>uk</td><td style='text-decoration: underline;'>71.71</td><td style=''>66.78</td><td style='font-weight: bold;'>74.12</td><td style=''>71.69</td></tr>
229
+ </table>
230
+
231
+ #### βšͺ️ Sentence Retrieval
232
+
233
+ ##### βšͺ️ BUCC2018
234
+ Metric used: F1.
235
+
236
+ <table>
237
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
238
+ <tr><td>de</td><td style=''>63.26</td><td style=''>66.83</td><td style='text-decoration: underline;'>75.23</td><td style='font-weight: bold;'>86.09</td></tr>
239
+ <tr><td>fr</td><td style=''>62.62</td><td style=''>65.79</td><td style='text-decoration: underline;'>69.29</td><td style='font-weight: bold;'>79.21</td></tr>
240
+ <tr><td>ru</td><td style=''>54.97</td><td style=''>70.12</td><td style='text-decoration: underline;'>75.57</td><td style='font-weight: bold;'>82.93</td></tr>
241
+ </table>
242
+
243
+ ##### βšͺ️ Tatoeba
244
+ Metric used: Accuracy.
245
+
246
+ <table>
247
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
248
+ <tr><td>bg</td><td style=''>48.80</td><td style=''>66.90</td><td style='text-decoration: underline;'>71.60</td><td style='font-weight: bold;'>77.60</td></tr>
249
+ <tr><td>ca</td><td style=''>59.80</td><td style=''>57.30</td><td style='text-decoration: underline;'>62.20</td><td style='font-weight: bold;'>80.20</td></tr>
250
+ <tr><td>de</td><td style=''>75.40</td><td style=''>88.40</td><td style='text-decoration: underline;'>88.80</td><td style='font-weight: bold;'>95.60</td></tr>
251
+ <tr><td>el</td><td style=''>29.80</td><td style=''>51.60</td><td style='text-decoration: underline;'>61.80</td><td style='font-weight: bold;'>72.30</td></tr>
252
+ <tr><td>es</td><td style=''>64.10</td><td style=''>71.00</td><td style='text-decoration: underline;'>75.70</td><td style='font-weight: bold;'>89.70</td></tr>
253
+ <tr><td>et</td><td style=''>28.10</td><td style=''>44.20</td><td style='text-decoration: underline;'>52.20</td><td style='font-weight: bold;'>61.80</td></tr>
254
+ <tr><td>eu</td><td style=''>25.50</td><td style=''>26.10</td><td style='text-decoration: underline;'>35.80</td><td style='font-weight: bold;'>53.40</td></tr>
255
+ <tr><td>fi</td><td style=''>39.00</td><td style='text-decoration: underline;'>63.90</td><td style='font-weight: bold;'>71.60</td><td style='text-decoration: underline;'>63.90</td></tr>
256
+ <tr><td>fr</td><td style=''>64.30</td><td style=''>72.50</td><td style='text-decoration: underline;'>73.70</td><td style='font-weight: bold;'>81.30</td></tr>
257
+ <tr><td>hu</td><td style=''>36.90</td><td style=''>58.70</td><td style='font-weight: bold;'>65.40</td><td style='text-decoration: underline;'>62.40</td></tr>
258
+ <tr><td>it</td><td style=''>57.30</td><td style=''>64.70</td><td style='text-decoration: underline;'>68.30</td><td style='font-weight: bold;'>80.30</td></tr>
259
+ <tr><td>lt</td><td style=''>31.10</td><td style='text-decoration: underline;'>54.80</td><td style='font-weight: bold;'>59.60</td><td style=''>49.30</td></tr>
260
+ <tr><td>nl</td><td style=''>63.70</td><td style=''>76.80</td><td style='text-decoration: underline;'>80.80</td><td style='font-weight: bold;'>86.60</td></tr>
261
+ <tr><td>pl</td><td style=''>50.10</td><td style=''>65.20</td><td style='text-decoration: underline;'>75.90</td><td style='font-weight: bold;'>79.00</td></tr>
262
+ <tr><td>pt</td><td style=''>68.40</td><td style=''>76.60</td><td style='text-decoration: underline;'>82.20</td><td style='font-weight: bold;'>88.80</td></tr>
263
+ <tr><td>ro</td><td style=''>51.50</td><td style=''>68.80</td><td style='font-weight: bold;'>75.70</td><td style='text-decoration: underline;'>69.00</td></tr>
264
+ <tr><td>ru</td><td style=''>59.40</td><td style=''>69.80</td><td style='text-decoration: underline;'>74.10</td><td style='font-weight: bold;'>81.60</td></tr>
265
+ <tr><td>uk</td><td style=''>52.60</td><td style=''>57.30</td><td style='text-decoration: underline;'>69.10</td><td style='font-weight: bold;'>77.50</td></tr>
266
+ </table>
267
+
268
+
269
+
270
+
271
+
272
+
273
+ #### ⚫ Question Answering
274
+
275
+ ##### ⚫ XQUAD
276
+ Metric used: F1.
277
+
278
+ <table>
279
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
280
+ <tr><td>de</td><td style=''>73.55</td><td style='text-decoration: underline;'>74.81</td><td style='font-weight: bold;'>80.15</td><td style=''>73.92</td></tr>
281
+ <tr><td>el</td><td style=''>63.74</td><td style=''>73.34</td><td style='font-weight: bold;'>80.86</td><td style='text-decoration: underline;'>73.56</td></tr>
282
+ <tr><td>en</td><td style='text-decoration: underline;'>84.84</td><td style=''>84.22</td><td style='font-weight: bold;'>88.13</td><td style=''>82.70</td></tr>
283
+ <tr><td>es</td><td style=''>75.06</td><td style=''>76.44</td><td style='font-weight: bold;'>82.21</td><td style='text-decoration: underline;'>77.07</td></tr>
284
+ <tr><td>ru</td><td style=''>72.02</td><td style='text-decoration: underline;'>74.73</td><td style='font-weight: bold;'>80.11</td><td style=''>72.85</td></tr>
285
+ </table>
286
+
287
+
288
+ ##### ⚫ MLQA
289
+ Metric used: F1.
290
+
291
+ <table>
292
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
293
+ <tr><td>de</td><td style=''>57.68</td><td style=''>62.20</td><td style='font-weight: bold;'>68.78</td><td style='text-decoration: underline;'>63.25</td></tr>
294
+ <tr><td>en</td><td style=''>80.16</td><td style='text-decoration: underline;'>80.27</td><td style='font-weight: bold;'>83.52</td><td style=''>79.81</td></tr>
295
+ <tr><td>es</td><td style=''>64.90</td><td style=''>66.97</td><td style='font-weight: bold;'>72.93</td><td style='text-decoration: underline;'>68.14</td></tr>
296
+ </table>
297
+
298
+ ##### ⚫ TyDiQA
299
+ Metric used: F1.
300
+
301
+ <table>
302
+ <tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
303
+ <tr><td>en</td><td style='text-decoration: underline;'>68.26</td><td style=''>59.57</td><td style='font-weight: bold;'>71.33</td><td style=''>61.50</td></tr>
304
+ <tr><td>fi</td><td style='text-decoration: underline;'>55.70</td><td style=''>51.91</td><td style='font-weight: bold;'>70.62</td><td style=''>52.32</td></tr>
305
+ <tr><td>ru</td><td style='text-decoration: underline;'>53.71</td><td style=''>50.75</td><td style='font-weight: bold;'>64.48</td><td style=''>50.66</td></tr>
306
+ </table>
307
+
308
+
309
+
310
+ ### CLUB Benchmark
311
+
312
+ The [Catalan Language Understanding Benchmark](https://club.aina.bsc.es/datasets.html) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.
313
+
314
+ This comparison also includes RoBERTa-ca, a model derived from mRoBERTa by applying vocabulary adaptation and performing continual pre-training on a 95GB Catalan-only corpus. For further details, visit [here](https://huggingface.co/BSC-LT/RoBERTa-ca).
315
+
316
+ <table>
317
+ <tr><th>tasks</th><th style=''>roberta-base-bne (125M)</th><th style=''>berta (126M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>roberta-ca (125M)</th><th style=''>mRoBERTa (283M)</th></tr>
318
+ <tr><td>ner (F1)</td><td style=''>87.59</td><td style='text-decoration: underline;'>89.47</td><td style=''>85.89</td><td style=''>87.50</td><td style='text-decoration: underline;'>89.47</td><td style='font-weight: bold;'>89.70</td><td style=''>88.33</td></tr>
319
+ <tr><td>pos (F1)</td><td style=''>98.64</td><td style=''>98.89</td><td style=''>98.78</td><td style=''>98.91</td><td style='font-weight: bold;'>99.03</td><td style='text-decoration: underline;'>99.00</td><td style=''>98.98</td></tr>
320
+ <tr><td>sts (Person)</td><td style=''>74.27</td><td style=''>81.39</td><td style=''>77.05</td><td style=''>75.11</td><td style='font-weight: bold;'>83.49</td><td style='text-decoration: underline;'>82.99</td><td style=''>79.52</td></tr>
321
+ <tr><td>tc (Acc.)</td><td style='text-decoration: underline;'>73.86</td><td style=''>73.16</td><td style=''>72.00</td><td style=''>73.05</td><td style='font-weight: bold;'>74.10</td><td style=''>72.81</td><td style=''>72.41</td></tr>
322
+ <tr><td>te (Acc.)</td><td style=''>72.27</td><td style=''>80.11</td><td style=''>75.86</td><td style=''>78.27</td><td style='font-weight: bold;'>86.63</td><td style=''>82.14</td><td style='text-decoration: underline;'>82.38</td></tr>
323
+ <tr><td>viquiquad (F1)</td><td style=''>82.56</td><td style=''>86.74</td><td style=''>87.42</td><td style=''>86.81</td><td style='font-weight: bold;'>90.35</td><td style=''>87.31</td><td style='text-decoration: underline;'>87.86</td></tr>
324
+ <tr><td>xquad (F1)</td><td style=''>60.56</td><td style=''>67.38</td><td style=''>67.72</td><td style=''>68.56</td><td style='font-weight: bold;'>76.08</td><td style='text-decoration: underline;'>70.53</td><td style=''>69.40</td></tr>
325
+ </table>
326
+
327
+ ### Galician Benchmark
328
+
329
+ To evaluate performance in Galician, the models are tested on two tasks highlighted in [Bertinho's paper](https://arxiv.org/pdf/2103.13799):
330
+ - _NER task_: NER task using the [SLI NERC](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) dataset.
331
+ - _POS task_: POS task using the [Universal Dependencies](https://huggingface.co/datasets/universal-dependencies/universal_dependencies) dataset.
332
+
333
+
334
+ <table>
335
+ <tr><th>tasks</th><th style=''>bertinho (109M)</th><th style=''>roberta-base-bne (125M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>mRoBERTa (283M)</th></tr>
336
+ <tr><td>ner-dataset-SLI NERC (F1)</td><td style=''>86.27</td><td style=''>86.80</td><td style=''>86.22</td><td style=''>85.99</td><td style='font-weight: bold;'>88.10</td><td style='text-decoration: underline;'>87.75</td></tr>
337
+ <tr><td>pos-dataset-UD_GL_CTG (F1)</td><td style=''>97.58</td><td style=''>97.27</td><td style=''>97.57</td><td style='text-decoration: underline;'>97.77</td><td style='font-weight: bold;'>97.95</td><td style=''>97.75</td></tr>
338
+ </table>
339
+
340
+
341
+
342
+
343
+ ### BasqueGLUE Benchmark
344
+
345
+ To assess the model performance in the Basque language, the [BasqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE) Benchmark is used as the baseline. BasqueGLUE has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. Some of the tasks has been slightly adapted to easily assess all the models (e.g, FMTODeu_slot is originally described as "Slot filling" task, but it is evaluated as NERC, since it follows the BIO annotation scheme).
346
+
347
+ <table>
348
+ <tr><th>tasks</th><th style=''>roberta-base-bne (125M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>mRoBERTa (283M)</th></tr>
349
+ <tr><td>NERC - NERCid (F1)</td><td style=''>71.53</td><td style=''>79.98</td><td style='text-decoration: underline;'>81.74</td><td style='font-weight: bold;'>83.96</td><td style=''>80.86</td></tr>
350
+ <tr><td>NERC - NERCood (F1)</td><td style=''>61.47</td><td style=''>76.95</td><td style=''>76.23</td><td style='font-weight: bold;'>80.25</td><td style='text-decoration: underline;'>76.97</td></tr>
351
+ <tr><td>NERC - FMTODeu_slot (F1)</td><td style=''>72.70</td><td style=''>73.65</td><td style=''>73.80</td><td style='text-decoration: underline;'>77.09</td><td style='font-weight: bold;'>77.32</td></tr>
352
+ <tr><td>Sentiment Analysis - BEC2016eu (Acc.)</td><td style=''>67.13</td><td style=''>67.05</td><td style='font-weight: bold;'>69.89</td><td style=''>67.90</td><td style='text-decoration: underline;'>69.20</td></tr>
353
+ <tr><td>Topic Classification - BHTCv2 (Acc.)</td><td style=''>66.72</td><td style=''>70.17</td><td style=''>72.01</td><td style='font-weight: bold;'>75.78</td><td style='text-decoration: underline;'>72.55</td></tr>
354
+ <tr><td>Intent Classification - FMTODeu_intent (Acc.)</td><td style=''>78.38</td><td style=''>78.01</td><td style=''>82.15</td><td style='font-weight: bold;'>83.35</td><td style='text-decoration: underline;'>83.07</td></tr>
355
+ <tr><td>Stance Detection - VaxxStance (Acc.)</td><td style=''>58.01</td><td style='font-weight: bold;'>66.67</td><td style=''>61.22</td><td style='text-decoration: underline;'>66.03</td><td style=''>65.71</td></tr>
356
+ </table>
357
+
358
+
359
+
360
+ ## Additional information
361
+
362
+ ### Author
363
+ The Language Technologies Unit from Barcelona Supercomputing Center.
364
+
365
+ ### Contact
366
+ For further information, please send an email to <[email protected]>.
367
+
368
+ ### Copyright
369
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
370
+
371
+
372
+ ### Funding
373
+
374
+ This work has been promoted and financed by the Ministerio para la TransformaciΓ³n Digital y de la FunciΓ³n PΓΊblica and Plan de RecuperaciΓ³n, TransformaciΓ³n y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
375
+
376
+ ### Acknowledgements
377
+
378
+ This project has benefited from the contributions of numerous teams and institutions through data contributions.
379
+
380
+ In Catalonia, many institutions have been involved in the project. Our thanks to Γ’mnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, RacΓ³ CatalΓ , Vilaweb, ACN, NaciΓ³ Digital, El mΓ³n and AquΓ­ BerguedΓ .
381
+
382
+ At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, FundaciΓ³n Dialnet, FundaciΓ³n Elcano and the β€˜Instituto Universitario de Sistemas Inteligentes y Aplicaciones NumΓ©ricas en IngenierΓ­a (SIANI)’ of the University of Las Palmas de Gran Canaria.
383
+
384
+ At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.
385
+
386
+ Their valuable efforts have been instrumental in the development of this work.
387
+
388
+
389
+ ### Disclaimer
390
+ Be aware that the model may contain biases or other unintended distortions.
391
+ When third parties deploy systems or provide services based on this model, or use the model themselves,
392
+ they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
393
+ including those governing the use of Artificial Intelligence.
394
+
395
+ The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
396
+
397
+ ### License
398
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.45.2",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 256000
26
+ }
images/roberta_pretraining_lang_distribution.png ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:922af003e4d00f517225ba8e65437aa2021e9995850a04019fb58aba1d90f26a
3
+ size 1132679048
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e20d0734fd085e691495bb99dc5a11693eed22cbf1c841eef4a066efcaa8b473
3
+ size 1132707134
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f2bf5621066c920a83c8b0f75b8d0d95b8b1bad04ac245c6746e6d2b303b76b
3
+ size 37007530
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5072e3209a04aa01dbf4db72b8fec52cf8cd06a042c9ba819678e084f7b665d5
3
+ size 4813283
tokenizer_config.json ADDED
@@ -0,0 +1,1098 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<s>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<pad>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<unk>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<|im_start|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<|im_end|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<|reserved_token_1|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<|reserved_token_2|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "8": {
71
+ "content": "<|reserved_token_3|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "9": {
79
+ "content": "<|reserved_token_4|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "10": {
87
+ "content": "<|reserved_token_5|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "11": {
95
+ "content": "<|reserved_token_6|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "12": {
103
+ "content": "<|reserved_token_7|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "13": {
111
+ "content": "<|reserved_token_8|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "14": {
119
+ "content": "<|reserved_token_9|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": true
125
+ },
126
+ "15": {
127
+ "content": "<|reserved_token_10|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": true
133
+ },
134
+ "16": {
135
+ "content": "<|reserved_token_11|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": true
141
+ },
142
+ "17": {
143
+ "content": "<|reserved_token_12|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": true
149
+ },
150
+ "18": {
151
+ "content": "<|reserved_token_13|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": true
157
+ },
158
+ "19": {
159
+ "content": "<|reserved_token_14|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": true
165
+ },
166
+ "20": {
167
+ "content": "<|reserved_token_15|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": true
173
+ },
174
+ "21": {
175
+ "content": "<|reserved_token_16|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": true
181
+ },
182
+ "22": {
183
+ "content": "<|reserved_token_17|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": true
189
+ },
190
+ "23": {
191
+ "content": "<|reserved_token_18|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": true
197
+ },
198
+ "24": {
199
+ "content": "<|reserved_token_19|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": true
205
+ },
206
+ "25": {
207
+ "content": "<|reserved_token_20|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": true
213
+ },
214
+ "26": {
215
+ "content": "<|reserved_token_21|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": true
221
+ },
222
+ "27": {
223
+ "content": "<|reserved_token_22|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": true
229
+ },
230
+ "28": {
231
+ "content": "<|reserved_token_23|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": true
237
+ },
238
+ "29": {
239
+ "content": "<|reserved_token_24|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": true
245
+ },
246
+ "30": {
247
+ "content": "<|reserved_token_25|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": true
253
+ },
254
+ "31": {
255
+ "content": "<|reserved_token_26|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": true
261
+ },
262
+ "32": {
263
+ "content": "<|reserved_token_27|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": true
269
+ },
270
+ "33": {
271
+ "content": "<|reserved_token_28|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": true
277
+ },
278
+ "34": {
279
+ "content": "<|reserved_token_29|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": true
285
+ },
286
+ "35": {
287
+ "content": "<|reserved_token_30|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": true
293
+ },
294
+ "36": {
295
+ "content": "<|reserved_token_31|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": true
301
+ },
302
+ "37": {
303
+ "content": "<|reserved_token_32|>",
304
+ "lstrip": false,
305
+ "normalized": false,
306
+ "rstrip": false,
307
+ "single_word": false,
308
+ "special": true
309
+ },
310
+ "38": {
311
+ "content": "<|reserved_token_33|>",
312
+ "lstrip": false,
313
+ "normalized": false,
314
+ "rstrip": false,
315
+ "single_word": false,
316
+ "special": true
317
+ },
318
+ "39": {
319
+ "content": "<|reserved_token_34|>",
320
+ "lstrip": false,
321
+ "normalized": false,
322
+ "rstrip": false,
323
+ "single_word": false,
324
+ "special": true
325
+ },
326
+ "40": {
327
+ "content": "<|reserved_token_35|>",
328
+ "lstrip": false,
329
+ "normalized": false,
330
+ "rstrip": false,
331
+ "single_word": false,
332
+ "special": true
333
+ },
334
+ "41": {
335
+ "content": "<|reserved_token_36|>",
336
+ "lstrip": false,
337
+ "normalized": false,
338
+ "rstrip": false,
339
+ "single_word": false,
340
+ "special": true
341
+ },
342
+ "42": {
343
+ "content": "<|reserved_token_37|>",
344
+ "lstrip": false,
345
+ "normalized": false,
346
+ "rstrip": false,
347
+ "single_word": false,
348
+ "special": true
349
+ },
350
+ "43": {
351
+ "content": "<|reserved_token_38|>",
352
+ "lstrip": false,
353
+ "normalized": false,
354
+ "rstrip": false,
355
+ "single_word": false,
356
+ "special": true
357
+ },
358
+ "44": {
359
+ "content": "<|reserved_token_39|>",
360
+ "lstrip": false,
361
+ "normalized": false,
362
+ "rstrip": false,
363
+ "single_word": false,
364
+ "special": true
365
+ },
366
+ "45": {
367
+ "content": "<|reserved_token_40|>",
368
+ "lstrip": false,
369
+ "normalized": false,
370
+ "rstrip": false,
371
+ "single_word": false,
372
+ "special": true
373
+ },
374
+ "46": {
375
+ "content": "<|reserved_token_41|>",
376
+ "lstrip": false,
377
+ "normalized": false,
378
+ "rstrip": false,
379
+ "single_word": false,
380
+ "special": true
381
+ },
382
+ "47": {
383
+ "content": "<|reserved_token_42|>",
384
+ "lstrip": false,
385
+ "normalized": false,
386
+ "rstrip": false,
387
+ "single_word": false,
388
+ "special": true
389
+ },
390
+ "48": {
391
+ "content": "<|reserved_token_43|>",
392
+ "lstrip": false,
393
+ "normalized": false,
394
+ "rstrip": false,
395
+ "single_word": false,
396
+ "special": true
397
+ },
398
+ "49": {
399
+ "content": "<|reserved_token_44|>",
400
+ "lstrip": false,
401
+ "normalized": false,
402
+ "rstrip": false,
403
+ "single_word": false,
404
+ "special": true
405
+ },
406
+ "50": {
407
+ "content": "<|reserved_token_45|>",
408
+ "lstrip": false,
409
+ "normalized": false,
410
+ "rstrip": false,
411
+ "single_word": false,
412
+ "special": true
413
+ },
414
+ "51": {
415
+ "content": "<|reserved_token_46|>",
416
+ "lstrip": false,
417
+ "normalized": false,
418
+ "rstrip": false,
419
+ "single_word": false,
420
+ "special": true
421
+ },
422
+ "52": {
423
+ "content": "<|reserved_token_47|>",
424
+ "lstrip": false,
425
+ "normalized": false,
426
+ "rstrip": false,
427
+ "single_word": false,
428
+ "special": true
429
+ },
430
+ "53": {
431
+ "content": "<|reserved_token_48|>",
432
+ "lstrip": false,
433
+ "normalized": false,
434
+ "rstrip": false,
435
+ "single_word": false,
436
+ "special": true
437
+ },
438
+ "54": {
439
+ "content": "<|reserved_token_49|>",
440
+ "lstrip": false,
441
+ "normalized": false,
442
+ "rstrip": false,
443
+ "single_word": false,
444
+ "special": true
445
+ },
446
+ "55": {
447
+ "content": "<|reserved_token_50|>",
448
+ "lstrip": false,
449
+ "normalized": false,
450
+ "rstrip": false,
451
+ "single_word": false,
452
+ "special": true
453
+ },
454
+ "56": {
455
+ "content": "<|reserved_token_51|>",
456
+ "lstrip": false,
457
+ "normalized": false,
458
+ "rstrip": false,
459
+ "single_word": false,
460
+ "special": true
461
+ },
462
+ "57": {
463
+ "content": "<|reserved_token_52|>",
464
+ "lstrip": false,
465
+ "normalized": false,
466
+ "rstrip": false,
467
+ "single_word": false,
468
+ "special": true
469
+ },
470
+ "58": {
471
+ "content": "<|reserved_token_53|>",
472
+ "lstrip": false,
473
+ "normalized": false,
474
+ "rstrip": false,
475
+ "single_word": false,
476
+ "special": true
477
+ },
478
+ "59": {
479
+ "content": "<|reserved_token_54|>",
480
+ "lstrip": false,
481
+ "normalized": false,
482
+ "rstrip": false,
483
+ "single_word": false,
484
+ "special": true
485
+ },
486
+ "60": {
487
+ "content": "<|reserved_token_55|>",
488
+ "lstrip": false,
489
+ "normalized": false,
490
+ "rstrip": false,
491
+ "single_word": false,
492
+ "special": true
493
+ },
494
+ "61": {
495
+ "content": "<|reserved_token_56|>",
496
+ "lstrip": false,
497
+ "normalized": false,
498
+ "rstrip": false,
499
+ "single_word": false,
500
+ "special": true
501
+ },
502
+ "62": {
503
+ "content": "<|reserved_token_57|>",
504
+ "lstrip": false,
505
+ "normalized": false,
506
+ "rstrip": false,
507
+ "single_word": false,
508
+ "special": true
509
+ },
510
+ "63": {
511
+ "content": "<|reserved_token_58|>",
512
+ "lstrip": false,
513
+ "normalized": false,
514
+ "rstrip": false,
515
+ "single_word": false,
516
+ "special": true
517
+ },
518
+ "64": {
519
+ "content": "<|reserved_token_59|>",
520
+ "lstrip": false,
521
+ "normalized": false,
522
+ "rstrip": false,
523
+ "single_word": false,
524
+ "special": true
525
+ },
526
+ "65": {
527
+ "content": "<|reserved_token_60|>",
528
+ "lstrip": false,
529
+ "normalized": false,
530
+ "rstrip": false,
531
+ "single_word": false,
532
+ "special": true
533
+ },
534
+ "66": {
535
+ "content": "<|reserved_token_61|>",
536
+ "lstrip": false,
537
+ "normalized": false,
538
+ "rstrip": false,
539
+ "single_word": false,
540
+ "special": true
541
+ },
542
+ "67": {
543
+ "content": "<|reserved_token_62|>",
544
+ "lstrip": false,
545
+ "normalized": false,
546
+ "rstrip": false,
547
+ "single_word": false,
548
+ "special": true
549
+ },
550
+ "68": {
551
+ "content": "<|reserved_token_63|>",
552
+ "lstrip": false,
553
+ "normalized": false,
554
+ "rstrip": false,
555
+ "single_word": false,
556
+ "special": true
557
+ },
558
+ "69": {
559
+ "content": "<|reserved_token_64|>",
560
+ "lstrip": false,
561
+ "normalized": false,
562
+ "rstrip": false,
563
+ "single_word": false,
564
+ "special": true
565
+ },
566
+ "70": {
567
+ "content": "<|reserved_token_65|>",
568
+ "lstrip": false,
569
+ "normalized": false,
570
+ "rstrip": false,
571
+ "single_word": false,
572
+ "special": true
573
+ },
574
+ "71": {
575
+ "content": "<|reserved_token_66|>",
576
+ "lstrip": false,
577
+ "normalized": false,
578
+ "rstrip": false,
579
+ "single_word": false,
580
+ "special": true
581
+ },
582
+ "72": {
583
+ "content": "<|reserved_token_67|>",
584
+ "lstrip": false,
585
+ "normalized": false,
586
+ "rstrip": false,
587
+ "single_word": false,
588
+ "special": true
589
+ },
590
+ "73": {
591
+ "content": "<|reserved_token_68|>",
592
+ "lstrip": false,
593
+ "normalized": false,
594
+ "rstrip": false,
595
+ "single_word": false,
596
+ "special": true
597
+ },
598
+ "74": {
599
+ "content": "<|reserved_token_69|>",
600
+ "lstrip": false,
601
+ "normalized": false,
602
+ "rstrip": false,
603
+ "single_word": false,
604
+ "special": true
605
+ },
606
+ "75": {
607
+ "content": "<|reserved_token_70|>",
608
+ "lstrip": false,
609
+ "normalized": false,
610
+ "rstrip": false,
611
+ "single_word": false,
612
+ "special": true
613
+ },
614
+ "76": {
615
+ "content": "<|reserved_token_71|>",
616
+ "lstrip": false,
617
+ "normalized": false,
618
+ "rstrip": false,
619
+ "single_word": false,
620
+ "special": true
621
+ },
622
+ "77": {
623
+ "content": "<|reserved_token_72|>",
624
+ "lstrip": false,
625
+ "normalized": false,
626
+ "rstrip": false,
627
+ "single_word": false,
628
+ "special": true
629
+ },
630
+ "78": {
631
+ "content": "<|reserved_token_73|>",
632
+ "lstrip": false,
633
+ "normalized": false,
634
+ "rstrip": false,
635
+ "single_word": false,
636
+ "special": true
637
+ },
638
+ "79": {
639
+ "content": "<|reserved_token_74|>",
640
+ "lstrip": false,
641
+ "normalized": false,
642
+ "rstrip": false,
643
+ "single_word": false,
644
+ "special": true
645
+ },
646
+ "80": {
647
+ "content": "<|reserved_token_75|>",
648
+ "lstrip": false,
649
+ "normalized": false,
650
+ "rstrip": false,
651
+ "single_word": false,
652
+ "special": true
653
+ },
654
+ "81": {
655
+ "content": "<|reserved_token_76|>",
656
+ "lstrip": false,
657
+ "normalized": false,
658
+ "rstrip": false,
659
+ "single_word": false,
660
+ "special": true
661
+ },
662
+ "82": {
663
+ "content": "<|reserved_token_77|>",
664
+ "lstrip": false,
665
+ "normalized": false,
666
+ "rstrip": false,
667
+ "single_word": false,
668
+ "special": true
669
+ },
670
+ "83": {
671
+ "content": "<|reserved_token_78|>",
672
+ "lstrip": false,
673
+ "normalized": false,
674
+ "rstrip": false,
675
+ "single_word": false,
676
+ "special": true
677
+ },
678
+ "84": {
679
+ "content": "<|reserved_token_79|>",
680
+ "lstrip": false,
681
+ "normalized": false,
682
+ "rstrip": false,
683
+ "single_word": false,
684
+ "special": true
685
+ },
686
+ "85": {
687
+ "content": "<|reserved_token_80|>",
688
+ "lstrip": false,
689
+ "normalized": false,
690
+ "rstrip": false,
691
+ "single_word": false,
692
+ "special": true
693
+ },
694
+ "86": {
695
+ "content": "<|reserved_token_81|>",
696
+ "lstrip": false,
697
+ "normalized": false,
698
+ "rstrip": false,
699
+ "single_word": false,
700
+ "special": true
701
+ },
702
+ "87": {
703
+ "content": "<|reserved_token_82|>",
704
+ "lstrip": false,
705
+ "normalized": false,
706
+ "rstrip": false,
707
+ "single_word": false,
708
+ "special": true
709
+ },
710
+ "88": {
711
+ "content": "<|reserved_token_83|>",
712
+ "lstrip": false,
713
+ "normalized": false,
714
+ "rstrip": false,
715
+ "single_word": false,
716
+ "special": true
717
+ },
718
+ "89": {
719
+ "content": "<|reserved_token_84|>",
720
+ "lstrip": false,
721
+ "normalized": false,
722
+ "rstrip": false,
723
+ "single_word": false,
724
+ "special": true
725
+ },
726
+ "90": {
727
+ "content": "<|reserved_token_85|>",
728
+ "lstrip": false,
729
+ "normalized": false,
730
+ "rstrip": false,
731
+ "single_word": false,
732
+ "special": true
733
+ },
734
+ "91": {
735
+ "content": "<|reserved_token_86|>",
736
+ "lstrip": false,
737
+ "normalized": false,
738
+ "rstrip": false,
739
+ "single_word": false,
740
+ "special": true
741
+ },
742
+ "92": {
743
+ "content": "<|reserved_token_87|>",
744
+ "lstrip": false,
745
+ "normalized": false,
746
+ "rstrip": false,
747
+ "single_word": false,
748
+ "special": true
749
+ },
750
+ "93": {
751
+ "content": "<|reserved_token_88|>",
752
+ "lstrip": false,
753
+ "normalized": false,
754
+ "rstrip": false,
755
+ "single_word": false,
756
+ "special": true
757
+ },
758
+ "94": {
759
+ "content": "<|reserved_token_89|>",
760
+ "lstrip": false,
761
+ "normalized": false,
762
+ "rstrip": false,
763
+ "single_word": false,
764
+ "special": true
765
+ },
766
+ "95": {
767
+ "content": "<|reserved_token_90|>",
768
+ "lstrip": false,
769
+ "normalized": false,
770
+ "rstrip": false,
771
+ "single_word": false,
772
+ "special": true
773
+ },
774
+ "96": {
775
+ "content": "<|reserved_token_91|>",
776
+ "lstrip": false,
777
+ "normalized": false,
778
+ "rstrip": false,
779
+ "single_word": false,
780
+ "special": true
781
+ },
782
+ "97": {
783
+ "content": "<|reserved_token_92|>",
784
+ "lstrip": false,
785
+ "normalized": false,
786
+ "rstrip": false,
787
+ "single_word": false,
788
+ "special": true
789
+ },
790
+ "98": {
791
+ "content": "<|reserved_token_93|>",
792
+ "lstrip": false,
793
+ "normalized": false,
794
+ "rstrip": false,
795
+ "single_word": false,
796
+ "special": true
797
+ },
798
+ "99": {
799
+ "content": "<|reserved_token_94|>",
800
+ "lstrip": false,
801
+ "normalized": false,
802
+ "rstrip": false,
803
+ "single_word": false,
804
+ "special": true
805
+ },
806
+ "100": {
807
+ "content": "<|reserved_token_95|>",
808
+ "lstrip": false,
809
+ "normalized": false,
810
+ "rstrip": false,
811
+ "single_word": false,
812
+ "special": true
813
+ },
814
+ "101": {
815
+ "content": "<|reserved_token_96|>",
816
+ "lstrip": false,
817
+ "normalized": false,
818
+ "rstrip": false,
819
+ "single_word": false,
820
+ "special": true
821
+ },
822
+ "102": {
823
+ "content": "<|reserved_token_97|>",
824
+ "lstrip": false,
825
+ "normalized": false,
826
+ "rstrip": false,
827
+ "single_word": false,
828
+ "special": true
829
+ },
830
+ "103": {
831
+ "content": "<mask>",
832
+ "lstrip": false,
833
+ "normalized": false,
834
+ "rstrip": false,
835
+ "single_word": false,
836
+ "special": false
837
+ },
838
+ "104": {
839
+ "content": "\\r",
840
+ "lstrip": false,
841
+ "normalized": false,
842
+ "rstrip": false,
843
+ "single_word": false,
844
+ "special": false
845
+ },
846
+ "105": {
847
+ "content": "▁▁",
848
+ "lstrip": false,
849
+ "normalized": false,
850
+ "rstrip": false,
851
+ "single_word": false,
852
+ "special": false
853
+ },
854
+ "106": {
855
+ "content": "▁▁▁",
856
+ "lstrip": false,
857
+ "normalized": false,
858
+ "rstrip": false,
859
+ "single_word": false,
860
+ "special": false
861
+ },
862
+ "107": {
863
+ "content": "▁▁▁▁",
864
+ "lstrip": false,
865
+ "normalized": false,
866
+ "rstrip": false,
867
+ "single_word": false,
868
+ "special": false
869
+ },
870
+ "108": {
871
+ "content": "▁▁▁▁▁",
872
+ "lstrip": false,
873
+ "normalized": false,
874
+ "rstrip": false,
875
+ "single_word": false,
876
+ "special": false
877
+ },
878
+ "109": {
879
+ "content": "▁▁▁▁▁▁",
880
+ "lstrip": false,
881
+ "normalized": false,
882
+ "rstrip": false,
883
+ "single_word": false,
884
+ "special": false
885
+ },
886
+ "110": {
887
+ "content": "▁▁▁▁▁▁▁",
888
+ "lstrip": false,
889
+ "normalized": false,
890
+ "rstrip": false,
891
+ "single_word": false,
892
+ "special": false
893
+ },
894
+ "111": {
895
+ "content": "▁▁▁▁▁▁▁▁",
896
+ "lstrip": false,
897
+ "normalized": false,
898
+ "rstrip": false,
899
+ "single_word": false,
900
+ "special": false
901
+ },
902
+ "112": {
903
+ "content": "▁▁▁▁▁▁▁▁▁",
904
+ "lstrip": false,
905
+ "normalized": false,
906
+ "rstrip": false,
907
+ "single_word": false,
908
+ "special": false
909
+ },
910
+ "113": {
911
+ "content": "▁▁▁▁▁▁▁▁▁▁",
912
+ "lstrip": false,
913
+ "normalized": false,
914
+ "rstrip": false,
915
+ "single_word": false,
916
+ "special": false
917
+ },
918
+ "114": {
919
+ "content": "▁▁▁▁▁▁▁▁▁▁▁",
920
+ "lstrip": false,
921
+ "normalized": false,
922
+ "rstrip": false,
923
+ "single_word": false,
924
+ "special": false
925
+ },
926
+ "115": {
927
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁",
928
+ "lstrip": false,
929
+ "normalized": false,
930
+ "rstrip": false,
931
+ "single_word": false,
932
+ "special": false
933
+ },
934
+ "116": {
935
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁",
936
+ "lstrip": false,
937
+ "normalized": false,
938
+ "rstrip": false,
939
+ "single_word": false,
940
+ "special": false
941
+ },
942
+ "117": {
943
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
944
+ "lstrip": false,
945
+ "normalized": false,
946
+ "rstrip": false,
947
+ "single_word": false,
948
+ "special": false
949
+ },
950
+ "118": {
951
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
952
+ "lstrip": false,
953
+ "normalized": false,
954
+ "rstrip": false,
955
+ "single_word": false,
956
+ "special": false
957
+ },
958
+ "119": {
959
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
960
+ "lstrip": false,
961
+ "normalized": false,
962
+ "rstrip": false,
963
+ "single_word": false,
964
+ "special": false
965
+ },
966
+ "120": {
967
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
968
+ "lstrip": false,
969
+ "normalized": false,
970
+ "rstrip": false,
971
+ "single_word": false,
972
+ "special": false
973
+ },
974
+ "121": {
975
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
976
+ "lstrip": false,
977
+ "normalized": false,
978
+ "rstrip": false,
979
+ "single_word": false,
980
+ "special": false
981
+ },
982
+ "122": {
983
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
984
+ "lstrip": false,
985
+ "normalized": false,
986
+ "rstrip": false,
987
+ "single_word": false,
988
+ "special": false
989
+ },
990
+ "123": {
991
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
992
+ "lstrip": false,
993
+ "normalized": false,
994
+ "rstrip": false,
995
+ "single_word": false,
996
+ "special": false
997
+ },
998
+ "124": {
999
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
1000
+ "lstrip": false,
1001
+ "normalized": false,
1002
+ "rstrip": false,
1003
+ "single_word": false,
1004
+ "special": false
1005
+ },
1006
+ "125": {
1007
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
1008
+ "lstrip": false,
1009
+ "normalized": false,
1010
+ "rstrip": false,
1011
+ "single_word": false,
1012
+ "special": false
1013
+ },
1014
+ "126": {
1015
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
1016
+ "lstrip": false,
1017
+ "normalized": false,
1018
+ "rstrip": false,
1019
+ "single_word": false,
1020
+ "special": false
1021
+ },
1022
+ "127": {
1023
+ "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
1024
+ "lstrip": false,
1025
+ "normalized": false,
1026
+ "rstrip": false,
1027
+ "single_word": false,
1028
+ "special": false
1029
+ },
1030
+ "128": {
1031
+ "content": "\t\t",
1032
+ "lstrip": false,
1033
+ "normalized": false,
1034
+ "rstrip": false,
1035
+ "single_word": false,
1036
+ "special": false
1037
+ },
1038
+ "129": {
1039
+ "content": "\t\t\t",
1040
+ "lstrip": false,
1041
+ "normalized": false,
1042
+ "rstrip": false,
1043
+ "single_word": false,
1044
+ "special": false
1045
+ },
1046
+ "130": {
1047
+ "content": "\t\t\t\t",
1048
+ "lstrip": false,
1049
+ "normalized": false,
1050
+ "rstrip": false,
1051
+ "single_word": false,
1052
+ "special": false
1053
+ },
1054
+ "131": {
1055
+ "content": "\t\t\t\t\t",
1056
+ "lstrip": false,
1057
+ "normalized": false,
1058
+ "rstrip": false,
1059
+ "single_word": false,
1060
+ "special": false
1061
+ },
1062
+ "132": {
1063
+ "content": "\t\t\t\t\t\t",
1064
+ "lstrip": false,
1065
+ "normalized": false,
1066
+ "rstrip": false,
1067
+ "single_word": false,
1068
+ "special": false
1069
+ },
1070
+ "133": {
1071
+ "content": "\n\n",
1072
+ "lstrip": false,
1073
+ "normalized": false,
1074
+ "rstrip": false,
1075
+ "single_word": false,
1076
+ "special": false
1077
+ },
1078
+ "134": {
1079
+ "content": "\n\n\n",
1080
+ "lstrip": false,
1081
+ "normalized": false,
1082
+ "rstrip": false,
1083
+ "single_word": false,
1084
+ "special": false
1085
+ }
1086
+ },
1087
+ "bos_token": "<s>",
1088
+ "clean_up_tokenization_spaces": false,
1089
+ "eos_token": "</s>",
1090
+ "legacy": true,
1091
+ "model_max_length": 512,
1092
+ "pad_token": "<pad>",
1093
+ "sp_model_kwargs": {},
1094
+ "spaces_between_special_tokens": false,
1095
+ "tokenizer_class": "LlamaTokenizer",
1096
+ "unk_token": "<unk>",
1097
+ "use_default_system_prompt": false
1098
+ }