bertin-project
/

bertin-base-gaussian-exp-512seqlen

Model card Files Files and versions Metrics Training metrics Community

Pablogps commited on Jul 19, 2021

Commit

a8852c2

·

1 Parent(s): 67d5315

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 This is a **RoBERTa-base** model trained from scratch in Spanish.
-The training dataset is mc4 (1) subsampling documents to a total of about 50 million examples. Sampling is biased towards average perplexity values, discarding more often documents with very large values (poor quality) of very small values (short, repetitive texts).
 This model takes the one using sequence length 128 (2) and trains during 25.000 steps using sequence length 512.

 This is a **RoBERTa-base** model trained from scratch in Spanish.
+The training dataset is mc4 (1) subsampling documents to a total of about 50 million examples. Sampling is biased towards average perplexity values (using a Gaussian function), discarding more often documents with very large values (poor quality) of very small values (short, repetitive texts).
 This model takes the one using sequence length 128 (2) and trains during 25.000 steps using sequence length 512.