Tristan nielsr HF Staff commited on
Commit
60e48f8
·
verified ·
1 Parent(s): 0ca31bd

Improve model card: Add metadata, tags, and clarify model purpose (#1)

Browse files

- Improve model card: Add metadata, tags, and clarify model purpose (20544a721afad45109d9394e000e7b8fca4087ab)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +25 -3
README.md CHANGED
@@ -1,6 +1,28 @@
1
  ---
2
  license: mit
 
 
 
 
 
3
  ---
4
- This is the fastText pretraining data filter targeting
5
- the LAMBADA ES task, discussed in the main text of the Perplexity
6
- Correlations paper: https://arxiv.org/abs/2409.05816
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: fasttext
4
+ tags:
5
+ - pretraining-data
6
+ - data-selection
7
+ - data-filter
8
  ---
9
+
10
+ This is the fastText pretraining data filter targeting the LAMBADA ES task, discussed in the Perplexity Correlations paper: [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). This filter uses perplexity correlations to identify high-quality pretraining data *without* requiring any LLM training.
11
+
12
+ This model is a data filter, not a language model itself, and should be used to select high-quality data for training LLMs. The filter works by estimating optimal weights for pretraining data selection based on the correlation between LLM perplexity on the data and downstream benchmark performance. It then uses these weights to train a fastText classifier, which can be used to filter new text data and select the highest quality samples.
13
+
14
+ Code: https://github.com/TristanThrush/perplexity-correlations
15
+
16
+ ## Citation
17
+
18
+ ```bibtex
19
+ @misc{thrush2024perplexitycorrelations,
20
+ title={Improving Pretraining Data Using Perplexity Correlations},
21
+ author={Tristan Thrush and Christopher Potts and Tatsunori Hashimoto},
22
+ year={2024},
23
+ eprint={2409.05816},
24
+ archivePrefix={arXiv},
25
+ primaryClass={cs.CL},
26
+ url={https://arxiv.org/abs/2409.05816},
27
+ }
28
+ ```