Papers
arxiv:2505.11966

Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

Published on May 17
· Submitted by Jianyuan1 on May 21
Authors:
,
,

Abstract

FlexiVe, a novel generative verifier, optimally balances computational resources for enhanced LLM reasoning, improving accuracy and efficiency on complex tasks.

AI-generated summary

Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.

Community

Paper author Paper submitter

This paper introduces Flexive, a novel generative verifier, and the Solve-Detect-Verify pipeline to address the trade-off between accuracy and computational efficiency in Large Language Model (LLM) reasoning.

Flexive dynamically balances "fast thinking" (rapid, resource-efficient error diagnosis) and "slow thinking" (meticulous, computationally-intensive analysis) using a Flexible Allocation of Verification Budget strategy. This strategy first uses efficient, parallel assessments to gauge verification difficulty before escalating to deeper analysis if needed. Flexive is trained using Group Relative Policy Optimization (GRPO) for mistake detection.

The Solve-Detect-Verify pipeline integrates Flexive into an efficient inference-time scaling framework. It consists of three stages:

  • Solve: An LLM generates an initial solution.
  • Detect: A lightweight mechanism monitors the LLM's output for hesitation keywords and uses log-- probabilities to assess if a solution is complete, potentially pausing generation early.
  • Verify and Refine: Flexive assesses the candidate solution. If correct, it's finalized. If errors are found, Flexive's feedback guides the solver to generate a single new, refined solution.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.11966 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.11966 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.11966 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.