LGSep 27, 2025

Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

arXiv:2509.23152v14.1h-index: 26

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in LLM reasoning for AI researchers, offering an incremental improvement by enhancing verifier training with critique signals.

The paper tackles the problem of test-time scaling for LLMs by addressing the limitation of reward model selection in identifying minority-yet-correct answers, introducing Mirror-Critique to train verifiers with informative critiques, resulting in significant improvements in solution accuracy and honesty over majority voting.

Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.

View on arXiv PDF

Similar