nielsr HF Staff commited on
Commit
20544a7
·
verified ·
1 Parent(s): 0ca31bd

Improve model card: Add metadata, tags, and clarify model purpose

Browse files

This PR improves the model card by:

* Adding metadata: `library_name` (fasttext), license (MIT), and relevant tags.
* Clarifying the model's purpose as a data filter for pretraining LLMs, not a language model itself.
* Adding a link to the GitHub repository.
* Improving the overall description to be more informative and concise.

Files changed (1) hide show
  1. README.md +25 -3
README.md CHANGED
@@ -1,6 +1,28 @@
1
  ---
2
  license: mit
 
 
 
 
 
3
  ---
4
- This is the fastText pretraining data filter targeting
5
- the LAMBADA ES task, discussed in the main text of the Perplexity
6
- Correlations paper: https://arxiv.org/abs/2409.05816
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: fasttext
4
+ tags:
5
+ - pretraining-data
6
+ - data-selection
7
+ - data-filter
8
  ---
9
+
10
+ This is the fastText pretraining data filter targeting the LAMBADA ES task, discussed in the Perplexity Correlations paper: [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). This filter uses perplexity correlations to identify high-quality pretraining data *without* requiring any LLM training.
11
+
12
+ This model is a data filter, not a language model itself, and should be used to select high-quality data for training LLMs. The filter works by estimating optimal weights for pretraining data selection based on the correlation between LLM perplexity on the data and downstream benchmark performance. It then uses these weights to train a fastText classifier, which can be used to filter new text data and select the highest quality samples.
13
+
14
+ Code: https://github.com/TristanThrush/perplexity-correlations
15
+
16
+ ## Citation
17
+
18
+ ```bibtex
19
+ @misc{thrush2024perplexitycorrelations,
20
+ title={Improving Pretraining Data Using Perplexity Correlations},
21
+ author={Tristan Thrush and Christopher Potts and Tatsunori Hashimoto},
22
+ year={2024},
23
+ eprint={2409.05816},
24
+ archivePrefix={arXiv},
25
+ primaryClass={cs.CL},
26
+ url={https://arxiv.org/abs/2409.05816},
27
+ }
28
+ ```