Papers
arxiv:2402.07625

AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Published on Feb 12, 2024
· Submitted by akhaliq on Feb 13, 2024
Authors:
,

Abstract

A novel strategy using meta-prompted language models autonomously selects high-quality mathematical data for continual pretraining, significantly improving a language model's mathematical reasoning and reducing token usage.

AI-generated summary

To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach utilizes meta-prompted language models as zero-shot verifiers to autonomously evaluate and select high-quality mathematical content, and we release the curated open-source AutoMathText dataset encompassing over 200GB of data. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter Mistral language model on the AutoMathText dataset, achieving substantial improvements in downstream performance on the MATH dataset with a token amount reduced by orders of magnitude compared to previous continuous pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author

The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings)

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 7