Update README.md
Browse files
README.md
CHANGED
@@ -303,7 +303,7 @@ for output in outputs:
|
|
303 |
|
304 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
305 |
The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
306 |
-
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
307 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
308 |
During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
309 |
This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
|
|
|
303 |
|
304 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
305 |
The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
306 |
+
and give more importance to Spain’s co-official languages (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
307 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
308 |
During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
309 |
This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
|