LGOct 30, 2025

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

arXiv:2510.27044v114.45 citationsh-index: 6Has Code

Originality Synthesis-oriented

AI Analysis

This work highlights the limits of RLVR generalization for mathematical reasoning in large language models, emphasizing the need for benchmarks that avoid shortcut exploitation.

The paper investigates Reinforcement Learning with Verifiable Rewards (RLVR) on two combinatorial mathematical reasoning problems, finding that it improves metrics but often by reinforcing superficial heuristics rather than acquiring genuine reasoning strategies.

Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.

View on arXiv PDF Code

Similar