Improve model card: Add metadata, tags, and clarify model purpose
Browse filesThis PR improves the model card by:
* Adding metadata: `library_name` (fasttext), license (MIT), and relevant tags.
* Clarifying the model's purpose as a data filter for pretraining LLMs, not a language model itself.
* Adding a link to the GitHub repository.
* Improving the overall description to be more informative and concise.
README.md
CHANGED
@@ -1,6 +1,28 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
-
|
5 |
-
the LAMBADA ES task, discussed in the
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
library_name: fasttext
|
4 |
+
tags:
|
5 |
+
- pretraining-data
|
6 |
+
- data-selection
|
7 |
+
- data-filter
|
8 |
---
|
9 |
+
|
10 |
+
This is the fastText pretraining data filter targeting the LAMBADA ES task, discussed in the Perplexity Correlations paper: [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). This filter uses perplexity correlations to identify high-quality pretraining data *without* requiring any LLM training.
|
11 |
+
|
12 |
+
This model is a data filter, not a language model itself, and should be used to select high-quality data for training LLMs. The filter works by estimating optimal weights for pretraining data selection based on the correlation between LLM perplexity on the data and downstream benchmark performance. It then uses these weights to train a fastText classifier, which can be used to filter new text data and select the highest quality samples.
|
13 |
+
|
14 |
+
Code: https://github.com/TristanThrush/perplexity-correlations
|
15 |
+
|
16 |
+
## Citation
|
17 |
+
|
18 |
+
```bibtex
|
19 |
+
@misc{thrush2024perplexitycorrelations,
|
20 |
+
title={Improving Pretraining Data Using Perplexity Correlations},
|
21 |
+
author={Tristan Thrush and Christopher Potts and Tatsunori Hashimoto},
|
22 |
+
year={2024},
|
23 |
+
eprint={2409.05816},
|
24 |
+
archivePrefix={arXiv},
|
25 |
+
primaryClass={cs.CL},
|
26 |
+
url={https://arxiv.org/abs/2409.05816},
|
27 |
+
}
|
28 |
+
```
|