arxiv:2504.16084

TTRL: Test-Time Reinforcement Learning

Published on Apr 22

· Submitted by

iseesaw on Apr 23

#2 Paper of the day

Upvote

Authors:

Yuxin Zuo ,

Kaiyan Zhang ,

Li Sheng ,

Xuekai Zhu ,

Biqing Qi ,

Youbang Sun ,

Ganqu Cui ,

Ning Ding ,

Abstract

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

View arXiv page View PDF GitHub repository Add to collection

Community

iseesaw

Paper author Paper submitter 3 days ago

GitHub: https://github.com/PRIME-RL/TTRL

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

archiki

1 day ago

Unless I am mistaken, the idea of using majority vote as a reward, be it to select chosen and rejected responses (in our ScPO paper: https://arxiv.org/abs/2411.04109) or for online RL algorithms (like in this paper), is quite similar and one of the takeaways of our work.

yuxinzuo

Paper author 1 day ago

Thank you for pointing this out. We sincerely apologize for missing your paper in our literature review. After reading it carefully, we acknowledge the similarities between our approaches.The main distinction lies in how majority voting is applied — whether it’s used to construct positive and negative examples for DPO, or to estimate rewards in the context of RL, as well as the test-time scenarios. We’ll update our paper shortly to cite and discuss your work in the main content, and plan to implement preference-optimization-style methods to further study their behaviors.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.16084 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.16084 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.16084 in a Space README.md to link it from this page.