Tristan's picture
Improve model card: Add metadata, tags, and clarify model purpose (#1)
60e48f8 verified
metadata
license: mit
library_name: fasttext
tags:
  - pretraining-data
  - data-selection
  - data-filter

This is the fastText pretraining data filter targeting the LAMBADA ES task, discussed in the Perplexity Correlations paper: Improving Pretraining Data Using Perplexity Correlations. This filter uses perplexity correlations to identify high-quality pretraining data without requiring any LLM training.

This model is a data filter, not a language model itself, and should be used to select high-quality data for training LLMs. The filter works by estimating optimal weights for pretraining data selection based on the correlation between LLM perplexity on the data and downstream benchmark performance. It then uses these weights to train a fastText classifier, which can be used to filter new text data and select the highest quality samples.

Code: https://github.com/TristanThrush/perplexity-correlations

Citation

@misc{thrush2024perplexitycorrelations,
      title={Improving Pretraining Data Using Perplexity Correlations}, 
      author={Tristan Thrush and Christopher Potts and Tatsunori Hashimoto},
      year={2024},
      eprint={2409.05816},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.05816}, 
}