Papers
arxiv:2504.02495

Inference-Time Scaling for Generalist Reward Modeling

Published on Apr 3
ยท Submitted by BestWishYsh on Apr 4
Authors:
,
,
,
,
,
,

Abstract

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Awesome..

Do checkout the NotebookLM AI generated podcast of this paper.
(prompted to focus on an AI/ML audience and strictly instructed to not make jokes)

https://youtu.be/AUPkMDlQ8ZM?si=nEkvapG6xeVyVKEn

Great work! Are there any timelines for opening the code?

ยท

I would like to open the code but the main workhorse is NotebookLM that converts a document into a podcast. The rest of the code I've developed is to create the video overlay using free videos on pexels. The main script is an ffmpeg command that does most of the work. A lot of the code is to manage files, convert pexels videos into a standard format, remove background noise, finding length of each file to match the audio input and so on.

The code will be open sourced when I can find the time to clean up the codebase. Until them I'm focused on creating podcasts that I'd love to listen to. I'm currently creating podcasts on interesting codebases - check out the one on bleve, a golang search index library https://youtu.be/Fq60EK6c_H0

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.02495 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.02495 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.02495 in a Space README.md to link it from this page.

Collections including this paper 12