gonzalez-agirre commited on
Commit
5eab950
·
verified ·
1 Parent(s): aa8a4ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -303,7 +303,7 @@ for output in outputs:
303
 
304
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
305
  The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
306
- and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
307
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
308
  During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
309
  This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
 
303
 
304
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
305
  The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
306
+ and give more importance to Spain’s co-official languages (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
307
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
308
  During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
309
  This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below: