LGAICLMay 28, 2025

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

arXiv:2505.22203v210 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This work addresses reliability issues in verifiers for reinforcement learning with verifiable reward, which is crucial for training large reasoning models, though it is incremental in highlighting existing limitations.

The study analyzed rule-based and model-based verifiers in mathematical reasoning for RLVR, finding that rule-based verifiers have high false negative rates due to format issues, while model-based verifiers are vulnerable to hacking, leading to inflated rewards.

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct, particularly after fine-tuning. This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique challenges inherent to both rule-based and model-based verifiers and provide insights toward developing more accurate and robust reward systems for reinforcement learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes