Text Generation
Transformers
Safetensors
llama
text-generation-inference
jsaizant commited on
Commit
5d9e3ac
·
verified ·
1 Parent(s): 0e835f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -6
README.md CHANGED
@@ -39,6 +39,28 @@ language:
39
  - sr
40
  - sv
41
  - uk
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ---
43
 
44
  ![](./images/salamandra_header.png)
@@ -282,13 +304,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
282
  The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
283
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
284
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
285
- Following, we trained two additional epochs during which the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
286
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
287
 
288
- ![lang distrib](./images/corpus_languages.png)
289
 
290
  The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
291
- Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
292
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
293
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
294
  The remaining 10% comes from smaller sources in various languages.
@@ -430,7 +452,7 @@ To consult the data summary document with the respective licences, please send a
430
  </details>
431
 
432
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
433
- of the Colossal OSCAR dataset was replaced with FineWebEdu (350T subset), resulting in 2.68T tokens per epoch;
434
  and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
435
 
436
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
@@ -463,7 +485,7 @@ and public institutions, which can be found in detail in the acknowledgements.
463
 
464
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
465
 
466
- This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
467
 
468
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
469
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -1111,4 +1133,4 @@ Technical report coming soon.
1111
  |:---:|:---:|:---:|
1112
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1113
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1114
- |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
 
39
  - sr
40
  - sv
41
  - uk
42
+ datasets:
43
+ - oscar-corpus/colossal-oscar-1.0
44
+ - HuggingFaceFW/fineweb-edu
45
+ - joelniklaus/eurlex_resources
46
+ - joelito/legal-mc4
47
+ - projecte-aina/CATalog
48
+ - UFRGS/brwac
49
+ - community-datasets/hrwac
50
+ - danish-foundation-models/danish-gigaword
51
+ - HiTZ/euscrawl
52
+ - PleIAs/French-PD-Newspapers
53
+ - PleIAs/French-PD-Books
54
+ - AI-team-UoA/greek_legal_code
55
+ - HiTZ/latxa-corpus-v1.1
56
+ - allenai/peS2o
57
+ - pile-of-law/pile-of-law
58
+ - PORTULAN/parlamento-pt
59
+ - hoskinson-center/proof-pile
60
+ - togethercomputer/RedPajama-Data-1T
61
+ - bigcode/starcoderdata
62
+ - bjoernp/tagesschau-2018-2023
63
+ - EleutherAI/the_pile_deduplicated
64
  ---
65
 
66
  ![](./images/salamandra_header.png)
 
304
  The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
305
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
306
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
307
+ During the following epochs, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
308
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
309
 
310
+ ![lang distrib](./images/corpus_languages_1.1.png)
311
 
312
  The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
313
+ Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
314
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
315
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
316
  The remaining 10% comes from smaller sources in various languages.
 
452
  </details>
453
 
454
  The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
455
+ of the Colossal OSCAR dataset was replaced with FineWeb-Edu (350BT subset), resulting in 2.68T tokens per epoch;
456
  and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
457
 
458
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
 
485
 
486
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
487
 
488
+ This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
489
 
490
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
491
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 
1133
  |:---:|:---:|:---:|
1134
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1135
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1136
+ |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |