--- license: mit pipeline_tag: text-classification library_name: fasttext tags: - pretraining-data - data-selection - data-filter --- This is the fastText pretraining data filter targeting the LAMBADA ES task, discussed in the Perplexity Correlations paper: [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). This filter uses perplexity correlations to identify high-quality pretraining data *without* requiring any LLM training. This model is a data filter, not a language model itself, and should be used to select high-quality data for training LLMs. The filter works by estimating optimal weights for pretraining data selection based on the correlation between LLM perplexity on the data and downstream benchmark performance. It then uses these weights to train a fastText classifier, which can be used to filter new text data and select the highest quality samples. Code: https://github.com/TristanThrush/perplexity-correlations ## Citation ```bibtex @misc{thrush2024perplexitycorrelations, title={Improving Pretraining Data Using Perplexity Correlations}, author={Tristan Thrush and Christopher Potts and Tatsunori Hashimoto}, year={2024}, eprint={2409.05816}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.05816}, } ```