BSC-LT
/

ALIA-40b

@@ -303,7 +303,7 @@ for output in outputs:
 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
 The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
-and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
 During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:

 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
 The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
+and give more importance to Spain’s co-official languages (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
 During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below: