Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
Abstract
FlexiVe, a novel generative verifier, optimally balances computational resources for enhanced LLM reasoning, improving accuracy and efficiency on complex tasks.
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.
Community
This paper introduces Flexive, a novel generative verifier, and the Solve-Detect-Verify pipeline to address the trade-off between accuracy and computational efficiency in Large Language Model (LLM) reasoning.
Flexive dynamically balances "fast thinking" (rapid, resource-efficient error diagnosis) and "slow thinking" (meticulous, computationally-intensive analysis) using a Flexible Allocation of Verification Budget strategy. This strategy first uses efficient, parallel assessments to gauge verification difficulty before escalating to deeper analysis if needed. Flexive is trained using Group Relative Policy Optimization (GRPO) for mistake detection.
The Solve-Detect-Verify pipeline integrates Flexive into an efficient inference-time scaling framework. It consists of three stages:
- Solve: An LLM generates an initial solution.
- Detect: A lightweight mechanism monitors the LLM's output for hesitation keywords and uses log-- probabilities to assess if a solution is complete, potentially pausing generation early.
- Verify and Refine: Flexive assesses the candidate solution. If correct, it's finalized. If errors are found, Flexive's feedback guides the solver to generate a single new, refined solution.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling (2025)
- Scalable Chain of Thoughts via Elastic Reasoning (2025)
- Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods (2025)
- When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning (2025)
- Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers (2025)
- VerifiAgent: a Unified Verification Agent in Language Model Reasoning (2025)
- ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper