File size: 30,047 Bytes
a8ef4f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9e3671
a8ef4f0
 
 
 
 
b9e3671
a8ef4f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
---
language: 
- multilingual
- bg
- ca
- code
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- gl
- hr
- hu
- it
- lt
- lv
- mt
- nl
- nn
- \no
- oc
- pl
- pt
- ro
- ru
- sh
- sk
- sl
- sr
- sv
- uk
tags: 
  - roberta
  - mroberta
  - fill-mask
license: apache-2.0
library_name: transformers
---


# mRoBERTa Model Card

mRoBERTa is a new **multilingual foundational model** based on the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) architecture. It has been pretrained from scratch using 35 European languages and code. The pretraining corpus consists of **12.8TB of high-quality data**. This is significantly larger compared to previous state-of-the-art encoder-only foundational models like [XLM-RoBERTa-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large), whose training datasets included less multilingual data, amounting to 2.5TB.

## Technical Description

Technical details of the mRoBERTa model.

| Description          | Value         |
|-------------------------|:--------------|
| Model Parameters        | 283M          |
| Tokenizer Type          | SPM           |
| Vocabulary size         | 256,000       |
| Precision               | bfloat16      |
| Context length          | 512           |


Training Hyperparemeters

| Hyperparameter                | Value                             |
|-------------------------      |:--------------                    |
| Pretraining  Objective        | Masked Language Modeling          |
| Learning Rate                 | 7E-05                             |
| Learning Rate Scheduler       | Cosine                            |
| Warmup                        | 10k                               |
| Optimizer                     | AdamW                             |
| Optimizer Hyperparameters     | AdamW (β1=0.9,β2=0.98,ε =1e-06 )  |
| Optimizer Decay               | 1E-02                             |
| Global Batch Size             | 8192                              |
| Dropout                       | 1E-01                             |
| Attention Dropout             | 1E-01                             |
| Activation Function           | GeLU                             |







## Data

### Pretraining Corpus

The training corpus consists of **35 European languages** and 92 programming languages, amounting to a total of **12.8TB** of high-quality data.

This highly multilingual corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 66.06% of the total tokens. 
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%. 
The next largest sources are French PD at 3.12% and Proof Pile at 1.98%. 
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%. 
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
The remaining 10% comes from smaller sources in various languages.



The final pretraining language distribution  split by language can be seen in the following picture:
<img src="./images/roberta_pretraining_lang_distribution.png" alt="drawing"/>

<small>Further details about the pretraining corpus can be found [here](https://huggingface.co/BSC-LT/ALIA-40b), as it is the same as the Salamandra foundational model.</small>


# Multilingual Evaluation and Performance

Evaluation is done using multilingual benchmarks in order to assess the multilingual capabilities of the models.

The following multilingual benchmarks have been considered:

| Benchmark        | Description | Languages |       Source |   
|------------------|-------------|-----------|--------------|
| XTREME|  Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
| CLUB | Human-Annotated Catalan Benchmark  | ca | [LINK](https://club.aina.bsc.es/datasets.html) |
| Basque Custom Benchmark |  A set of NER, POS and TC evaluation tasks to assess the performace in Basque language.  | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
| Galician Custom Benchmark |  NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|


The following base foundational models have been considered:



 
| Multilingual Foundational Model | Number of Parameters | Vocab Size | Description |
|---------------------------------|----------------------|------------|-------------|
| [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca)                      |   126M              |     52K       |     BERTa is a Catalan-specific language model pretrained with Catalan-only data.        |
| [BERTinho](https://huggingface.co/dvilares/bertinho-gl-base-cased)                      |   109M              |     30K       |    BERTinho is monolingual BERT model for Galician language.          |
| [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased)                           | 178M                | 120K       |    Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia.         |
| [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa)                       | 283M                | 256K       |     RoBERTa base model pretrained  with 35 European languages and a larger vocabulary size.         |
| [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)                | 125M                | 50K        |  RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019.            |
| [RoBERTa-ca](https://huggingface.co/BSC-LT/RoBERTa-ca)                      | 125M                |   50K        |    RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa.        |
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                | 279M                | 250K       |   Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.          | 
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)               | 561M                | 250K       | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages.            |



## RESULTS

This section presents results across various multilingual benchmarks, with the maximum values highlighted in bold and the second-highest values underlined.

### XTREME Benchmark

The Cross-lingual TRansfer Evaluation of Multilingual Encoders ([XTREME](https://github.com/google-research/xtreme)) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, availability of training data, and overlap with the languages present during pre-training of the models.

#### 🔵 Sentence Classification

##### 🔵 XNLI
Metric used: Accuracy.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>bg</td><td style=''>69.34</td><td style='text-decoration: underline;'>78.26</td><td style='font-weight: bold;'>82.10</td><td style=''>77.56</td></tr>
<tr><td>de</td><td style=''>71.54</td><td style=''>76.75</td><td style='font-weight: bold;'>81.62</td><td style='text-decoration: underline;'>77.01</td></tr>
<tr><td>el</td><td style=''>66.51</td><td style='text-decoration: underline;'>76.37</td><td style='font-weight: bold;'>81.46</td><td style=''>76.35</td></tr>
<tr><td>en</td><td style=''>82.20</td><td style=''>84.45</td><td style='font-weight: bold;'>87.98</td><td style='text-decoration: underline;'>85.69</td></tr>
<tr><td>es</td><td style=''>74.81</td><td style=''>78.18</td><td style='font-weight: bold;'>83.65</td><td style='text-decoration: underline;'>79.66</td></tr>
<tr><td>fr</td><td style=''>74.25</td><td style=''>78.24</td><td style='font-weight: bold;'>82.71</td><td style='text-decoration: underline;'>79.16</td></tr>
<tr><td>ru</td><td style=''>68.56</td><td style='text-decoration: underline;'>76.21</td><td style='font-weight: bold;'>79.10</td><td style=''>74.73</td></tr>
</table>

##### 🔵  PAWS-X
Metric used: Accuracy.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>de</td><td style=''>85.65</td><td style='text-decoration: underline;'>86.95</td><td style=''>85.05</td><td style='font-weight: bold;'>87.35</td></tr>
<tr><td>en</td><td style=''>93.50</td><td style='text-decoration: underline;'>93.90</td><td style=''>91.45</td><td style='font-weight: bold;'>94.75</td></tr>
<tr><td>es</td><td style=''>87.75</td><td style='font-weight: bold;'>89.30</td><td style=''>87.65</td><td style='text-decoration: underline;'>88.60</td></tr>
<tr><td>fr</td><td style=''>86.60</td><td style='text-decoration: underline;'>88.55</td><td style=''>87.30</td><td style='font-weight: bold;'>89.20</td></tr>
</table>



#### 🟣 Structured Prediction: POS

##### 🟣  POS (UDPOS)
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>bg</td><td style=''>85.14</td><td style='text-decoration: underline;'>88.62</td><td style='font-weight: bold;'>89.06</td><td style=''>88.19</td></tr>
<tr><td>de</td><td style=''>85.71</td><td style=''>88.41</td><td style='font-weight: bold;'>88.65</td><td style='text-decoration: underline;'>88.58</td></tr>
<tr><td>el</td><td style=''>80.92</td><td style='font-weight: bold;'>87.12</td><td style=''>86.55</td><td style='text-decoration: underline;'>87.03</td></tr>
<tr><td>en</td><td style=''>95.43</td><td style=''>95.79</td><td style='font-weight: bold;'>96.07</td><td style='text-decoration: underline;'>95.85</td></tr>
<tr><td>es</td><td style=''>85.85</td><td style='text-decoration: underline;'>88.10</td><td style='font-weight: bold;'>89.31</td><td style=''>87.45</td></tr>
<tr><td>et</td><td style=''>79.68</td><td style=''>86.22</td><td style='font-weight: bold;'>87.36</td><td style='text-decoration: underline;'>86.25</td></tr>
<tr><td>eu</td><td style=''>60.18</td><td style=''>68.83</td><td style='font-weight: bold;'>71.85</td><td style='text-decoration: underline;'>69.22</td></tr>
<tr><td>fi</td><td style=''>79.72</td><td style='text-decoration: underline;'>85.90</td><td style='font-weight: bold;'>86.54</td><td style=''>84.23</td></tr>
<tr><td>fr</td><td style=''>81.20</td><td style=''>86.34</td><td style='font-weight: bold;'>88.24</td><td style='text-decoration: underline;'>87.00</td></tr>
<tr><td>hu</td><td style=''>78.39</td><td style='text-decoration: underline;'>83.05</td><td style='font-weight: bold;'>83.84</td><td style=''>82.96</td></tr>
<tr><td>it</td><td style=''>87.86</td><td style=''>88.91</td><td style='font-weight: bold;'>90.01</td><td style='text-decoration: underline;'>89.11</td></tr>
<tr><td>lt</td><td style=''>78.59</td><td style='text-decoration: underline;'>83.86</td><td style='font-weight: bold;'>84.91</td><td style=''>81.12</td></tr>
<tr><td>nl</td><td style=''>88.59</td><td style=''>89.16</td><td style='font-weight: bold;'>89.70</td><td style='text-decoration: underline;'>89.31</td></tr>
<tr><td>pl</td><td style=''>80.34</td><td style='text-decoration: underline;'>84.61</td><td style='font-weight: bold;'>85.77</td><td style=''>84.23</td></tr>
<tr><td>pt</td><td style=''>85.77</td><td style='text-decoration: underline;'>87.53</td><td style='font-weight: bold;'>88.56</td><td style=''>87.18</td></tr>
<tr><td>ro</td><td style=''>76.51</td><td style='text-decoration: underline;'>83.99</td><td style='font-weight: bold;'>86.47</td><td style=''>82.74</td></tr>
<tr><td>ru</td><td style=''>85.36</td><td style=''>88.75</td><td style='font-weight: bold;'>89.83</td><td style='text-decoration: underline;'>89.09</td></tr>
<tr><td>uk</td><td style=''>80.63</td><td style=''>84.79</td><td style='font-weight: bold;'>85.84</td><td style='text-decoration: underline;'>85.19</td></tr>
</table>


##### 🟣  NER (PANX)
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>bg</td><td style=''>78.38</td><td style=''>76.52</td><td style='font-weight: bold;'>81.97</td><td style='text-decoration: underline;'>78.66</td></tr>
<tr><td>de</td><td style='font-weight: bold;'>78.89</td><td style=''>73.92</td><td style='text-decoration: underline;'>78.59</td><td style=''>78.17</td></tr>
<tr><td>el</td><td style=''>74.09</td><td style=''>73.07</td><td style='font-weight: bold;'>75.49</td><td style='text-decoration: underline;'>74.81</td></tr>
<tr><td>en</td><td style='font-weight: bold;'>84.69</td><td style=''>82.70</td><td style='text-decoration: underline;'>84.50</td><td style=''>83.56</td></tr>
<tr><td>es</td><td style=''>72.32</td><td style=''>72.83</td><td style='text-decoration: underline;'>73.46</td><td style='font-weight: bold;'>78.30</td></tr>
<tr><td>et</td><td style='text-decoration: underline;'>77.55</td><td style=''>72.56</td><td style='font-weight: bold;'>78.37</td><td style=''>73.92</td></tr>
<tr><td>eu</td><td style='font-weight: bold;'>66.52</td><td style=''>58.34</td><td style='text-decoration: underline;'>60.01</td><td style=''>56.74</td></tr>
<tr><td>fi</td><td style='text-decoration: underline;'>78.11</td><td style=''>74.98</td><td style='font-weight: bold;'>78.46</td><td style=''>76.42</td></tr>
<tr><td>fr</td><td style='text-decoration: underline;'>79.45</td><td style=''>77.00</td><td style='font-weight: bold;'>80.16</td><td style=''>76.94</td></tr>
<tr><td>hu</td><td style='text-decoration: underline;'>77.39</td><td style=''>75.48</td><td style='font-weight: bold;'>80.10</td><td style=''>73.31</td></tr>
<tr><td>it</td><td style='font-weight: bold;'>81.33</td><td style=''>76.68</td><td style='text-decoration: underline;'>80.60</td><td style=''>80.04</td></tr>
<tr><td>lt</td><td style='text-decoration: underline;'>75.48</td><td style=''>73.76</td><td style='font-weight: bold;'>76.41</td><td style=''>72.71</td></tr>
<tr><td>nl</td><td style='text-decoration: underline;'>82.40</td><td style=''>79.80</td><td style='font-weight: bold;'>82.92</td><td style=''>81.42</td></tr>
<tr><td>pl</td><td style='font-weight: bold;'>80.57</td><td style=''>77.15</td><td style='text-decoration: underline;'>80.55</td><td style=''>80.26</td></tr>
<tr><td>pt</td><td style='text-decoration: underline;'>79.66</td><td style=''>76.60</td><td style='font-weight: bold;'>80.97</td><td style=''>76.13</td></tr>
<tr><td>ro</td><td style='text-decoration: underline;'>74.73</td><td style=''>71.79</td><td style='font-weight: bold;'>81.42</td><td style=''>66.85</td></tr>
<tr><td>ru</td><td style=''>65.42</td><td style=''>63.93</td><td style='font-weight: bold;'>70.68</td><td style='text-decoration: underline;'>67.53</td></tr>
<tr><td>uk</td><td style='text-decoration: underline;'>71.71</td><td style=''>66.78</td><td style='font-weight: bold;'>74.12</td><td style=''>71.69</td></tr>
</table>

#### ⚪️ Sentence Retrieval

##### ⚪️  BUCC2018
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>de</td><td style=''>63.26</td><td style=''>66.83</td><td style='text-decoration: underline;'>75.23</td><td style='font-weight: bold;'>86.09</td></tr>
<tr><td>fr</td><td style=''>62.62</td><td style=''>65.79</td><td style='text-decoration: underline;'>69.29</td><td style='font-weight: bold;'>79.21</td></tr>
<tr><td>ru</td><td style=''>54.97</td><td style=''>70.12</td><td style='text-decoration: underline;'>75.57</td><td style='font-weight: bold;'>82.93</td></tr>
</table>

##### ⚪️  Tatoeba
Metric used: Accuracy.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>bg</td><td style=''>48.80</td><td style=''>66.90</td><td style='text-decoration: underline;'>71.60</td><td style='font-weight: bold;'>77.60</td></tr>
<tr><td>ca</td><td style=''>59.80</td><td style=''>57.30</td><td style='text-decoration: underline;'>62.20</td><td style='font-weight: bold;'>80.20</td></tr>
<tr><td>de</td><td style=''>75.40</td><td style=''>88.40</td><td style='text-decoration: underline;'>88.80</td><td style='font-weight: bold;'>95.60</td></tr>
<tr><td>el</td><td style=''>29.80</td><td style=''>51.60</td><td style='text-decoration: underline;'>61.80</td><td style='font-weight: bold;'>72.30</td></tr>
<tr><td>es</td><td style=''>64.10</td><td style=''>71.00</td><td style='text-decoration: underline;'>75.70</td><td style='font-weight: bold;'>89.70</td></tr>
<tr><td>et</td><td style=''>28.10</td><td style=''>44.20</td><td style='text-decoration: underline;'>52.20</td><td style='font-weight: bold;'>61.80</td></tr>
<tr><td>eu</td><td style=''>25.50</td><td style=''>26.10</td><td style='text-decoration: underline;'>35.80</td><td style='font-weight: bold;'>53.40</td></tr>
<tr><td>fi</td><td style=''>39.00</td><td style='text-decoration: underline;'>63.90</td><td style='font-weight: bold;'>71.60</td><td style='text-decoration: underline;'>63.90</td></tr>
<tr><td>fr</td><td style=''>64.30</td><td style=''>72.50</td><td style='text-decoration: underline;'>73.70</td><td style='font-weight: bold;'>81.30</td></tr>
<tr><td>hu</td><td style=''>36.90</td><td style=''>58.70</td><td style='font-weight: bold;'>65.40</td><td style='text-decoration: underline;'>62.40</td></tr>
<tr><td>it</td><td style=''>57.30</td><td style=''>64.70</td><td style='text-decoration: underline;'>68.30</td><td style='font-weight: bold;'>80.30</td></tr>
<tr><td>lt</td><td style=''>31.10</td><td style='text-decoration: underline;'>54.80</td><td style='font-weight: bold;'>59.60</td><td style=''>49.30</td></tr>
<tr><td>nl</td><td style=''>63.70</td><td style=''>76.80</td><td style='text-decoration: underline;'>80.80</td><td style='font-weight: bold;'>86.60</td></tr>
<tr><td>pl</td><td style=''>50.10</td><td style=''>65.20</td><td style='text-decoration: underline;'>75.90</td><td style='font-weight: bold;'>79.00</td></tr>
<tr><td>pt</td><td style=''>68.40</td><td style=''>76.60</td><td style='text-decoration: underline;'>82.20</td><td style='font-weight: bold;'>88.80</td></tr>
<tr><td>ro</td><td style=''>51.50</td><td style=''>68.80</td><td style='font-weight: bold;'>75.70</td><td style='text-decoration: underline;'>69.00</td></tr>
<tr><td>ru</td><td style=''>59.40</td><td style=''>69.80</td><td style='text-decoration: underline;'>74.10</td><td style='font-weight: bold;'>81.60</td></tr>
<tr><td>uk</td><td style=''>52.60</td><td style=''>57.30</td><td style='text-decoration: underline;'>69.10</td><td style='font-weight: bold;'>77.50</td></tr>
</table>






#### ⚫ Question Answering

##### ⚫ XQUAD
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>de</td><td style=''>73.55</td><td style='text-decoration: underline;'>74.81</td><td style='font-weight: bold;'>80.15</td><td style=''>73.92</td></tr>
<tr><td>el</td><td style=''>63.74</td><td style=''>73.34</td><td style='font-weight: bold;'>80.86</td><td style='text-decoration: underline;'>73.56</td></tr>
<tr><td>en</td><td style='text-decoration: underline;'>84.84</td><td style=''>84.22</td><td style='font-weight: bold;'>88.13</td><td style=''>82.70</td></tr>
<tr><td>es</td><td style=''>75.06</td><td style=''>76.44</td><td style='font-weight: bold;'>82.21</td><td style='text-decoration: underline;'>77.07</td></tr>
<tr><td>ru</td><td style=''>72.02</td><td style='text-decoration: underline;'>74.73</td><td style='font-weight: bold;'>80.11</td><td style=''>72.85</td></tr>
</table>


##### ⚫ MLQA
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>de</td><td style=''>57.68</td><td style=''>62.20</td><td style='font-weight: bold;'>68.78</td><td style='text-decoration: underline;'>63.25</td></tr>
<tr><td>en</td><td style=''>80.16</td><td style='text-decoration: underline;'>80.27</td><td style='font-weight: bold;'>83.52</td><td style=''>79.81</td></tr>
<tr><td>es</td><td style=''>64.90</td><td style=''>66.97</td><td style='font-weight: bold;'>72.93</td><td style='text-decoration: underline;'>68.14</td></tr>
</table>

##### ⚫ TyDiQA
Metric used: F1.

<table>
<tr><th>langs</th><th>mBERT (178M)</th><th>xlm-roberta-base (279M)</th><th>xlm-roberta-large (561M)</th><th>mRoBERTa (283M)</th></tr>
<tr><td>en</td><td style='text-decoration: underline;'>68.26</td><td style=''>59.57</td><td style='font-weight: bold;'>71.33</td><td style=''>61.50</td></tr>
<tr><td>fi</td><td style='text-decoration: underline;'>55.70</td><td style=''>51.91</td><td style='font-weight: bold;'>70.62</td><td style=''>52.32</td></tr>
<tr><td>ru</td><td style='text-decoration: underline;'>53.71</td><td style=''>50.75</td><td style='font-weight: bold;'>64.48</td><td style=''>50.66</td></tr>
</table>



### CLUB Benchmark

The [Catalan Language Understanding Benchmark](https://club.aina.bsc.es/datasets.html) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language.

This comparison also includes RoBERTa-ca, a model derived from mRoBERTa by applying vocabulary adaptation and performing continual pre-training on a 95GB Catalan-only corpus. For further details, visit [here](https://huggingface.co/BSC-LT/RoBERTa-ca).

<table>
<tr><th>tasks</th><th style=''>roberta-base-bne (125M)</th><th style=''>berta (126M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>roberta-ca (125M)</th><th style=''>mRoBERTa (283M)</th></tr>
<tr><td>ner (F1)</td><td style=''>87.59</td><td style='text-decoration: underline;'>89.47</td><td style=''>85.89</td><td style=''>87.50</td><td style='text-decoration: underline;'>89.47</td><td style='font-weight: bold;'>89.70</td><td style=''>88.33</td></tr>
<tr><td>pos (F1)</td><td style=''>98.64</td><td style=''>98.89</td><td style=''>98.78</td><td style=''>98.91</td><td style='font-weight: bold;'>99.03</td><td style='text-decoration: underline;'>99.00</td><td style=''>98.98</td></tr>
<tr><td>sts (Person)</td><td style=''>74.27</td><td style=''>81.39</td><td style=''>77.05</td><td style=''>75.11</td><td style='font-weight: bold;'>83.49</td><td style='text-decoration: underline;'>82.99</td><td style=''>79.52</td></tr>
<tr><td>tc (Acc.)</td><td style='text-decoration: underline;'>73.86</td><td style=''>73.16</td><td style=''>72.00</td><td style=''>73.05</td><td style='font-weight: bold;'>74.10</td><td style=''>72.81</td><td style=''>72.41</td></tr>
<tr><td>te (Acc.)</td><td style=''>72.27</td><td style=''>80.11</td><td style=''>75.86</td><td style=''>78.27</td><td style='font-weight: bold;'>86.63</td><td style=''>82.14</td><td style='text-decoration: underline;'>82.38</td></tr>
<tr><td>viquiquad (F1)</td><td style=''>82.56</td><td style=''>86.74</td><td style=''>87.42</td><td style=''>86.81</td><td style='font-weight: bold;'>90.35</td><td style=''>87.31</td><td style='text-decoration: underline;'>87.86</td></tr>
<tr><td>xquad (F1)</td><td style=''>60.56</td><td style=''>67.38</td><td style=''>67.72</td><td style=''>68.56</td><td style='font-weight: bold;'>76.08</td><td style='text-decoration: underline;'>70.53</td><td style=''>69.40</td></tr>
</table>

###  Galician Benchmark

To evaluate performance in Galician, the models are tested on two tasks highlighted in [Bertinho's paper](https://arxiv.org/pdf/2103.13799):
 - _NER task_: NER task using the [SLI NERC](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) dataset.
 - _POS task_: POS task using the [Universal Dependencies](https://huggingface.co/datasets/universal-dependencies/universal_dependencies) dataset.


<table>
<tr><th>tasks</th><th style=''>bertinho (109M)</th><th style=''>roberta-base-bne (125M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>mRoBERTa (283M)</th></tr>
<tr><td>ner-dataset-SLI NERC (F1)</td><td style=''>86.27</td><td style=''>86.80</td><td style=''>86.22</td><td style=''>85.99</td><td style='font-weight: bold;'>88.10</td><td style='text-decoration: underline;'>87.75</td></tr>
<tr><td>pos-dataset-UD_GL_CTG (F1)</td><td style=''>97.58</td><td style=''>97.27</td><td style=''>97.57</td><td style='text-decoration: underline;'>97.77</td><td style='font-weight: bold;'>97.95</td><td style=''>97.75</td></tr>
</table>




###  BasqueGLUE Benchmark

To assess the model performance in the Basque language, the [BasqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE) Benchmark is used as the baseline. BasqueGLUE  has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. Some of the tasks has been slightly adapted to easily assess all the models (e.g, FMTODeu_slot is originally  described as  "Slot filling" task, but it is evaluated as NERC, since it follows the BIO annotation scheme).

<table>
<tr><th>tasks</th><th style=''>roberta-base-bne (125M)</th><th style=''>mBERT (178M)</th><th style=''>xlm-roberta-base (279M)</th><th style=''>xlm-roberta-large (561M)</th><th style=''>mRoBERTa (283M)</th></tr>
<tr><td>NERC - NERCid (F1)</td><td style=''>71.53</td><td style=''>79.98</td><td style='text-decoration: underline;'>81.74</td><td style='font-weight: bold;'>83.96</td><td style=''>80.86</td></tr>
<tr><td>NERC - NERCood (F1)</td><td style=''>61.47</td><td style=''>76.95</td><td style=''>76.23</td><td style='font-weight: bold;'>80.25</td><td style='text-decoration: underline;'>76.97</td></tr>
<tr><td>NERC - FMTODeu_slot (F1)</td><td style=''>72.70</td><td style=''>73.65</td><td style=''>73.80</td><td style='text-decoration: underline;'>77.09</td><td style='font-weight: bold;'>77.32</td></tr>
<tr><td>Sentiment Analysis - BEC2016eu (Acc.)</td><td style=''>67.13</td><td style=''>67.05</td><td style='font-weight: bold;'>69.89</td><td style=''>67.90</td><td style='text-decoration: underline;'>69.20</td></tr>
<tr><td>Topic Classification - BHTCv2 (Acc.)</td><td style=''>66.72</td><td style=''>70.17</td><td style=''>72.01</td><td style='font-weight: bold;'>75.78</td><td style='text-decoration: underline;'>72.55</td></tr>
<tr><td>Intent Classification - FMTODeu_intent (Acc.)</td><td style=''>78.38</td><td style=''>78.01</td><td style=''>82.15</td><td style='font-weight: bold;'>83.35</td><td style='text-decoration: underline;'>83.07</td></tr>
<tr><td>Stance Detection - VaxxStance (Acc.)</td><td style=''>58.01</td><td style='font-weight: bold;'>66.67</td><td style=''>61.22</td><td style='text-decoration: underline;'>66.03</td><td style=''>65.71</td></tr>
</table>



## Additional information

### Author
The Language Technologies Lab from Barcelona Supercomputing Center.

### Contact
For further information, please send an email to <[email protected]>.

### Copyright
Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.


### Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

### Acknowledgements

This project has benefited from the contributions of numerous teams and institutions through data contributions.

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration.

Their valuable efforts have been instrumental in the development of this work.


### Disclaimer
Be aware that the model may contain biases or other unintended distortions. 
When third parties deploy systems or provide services based on this model, or use the model themselves, 
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, 
including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)