CLAILGNov 21, 2025

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

arXiv:2511.17473v14 citations
AI Analysis

This addresses scalability limitations in RLVR for mathematical theorem proving and calculation tasks where intermediate reasoning is crucial but final answers are hard to verify.

The paper tackles the problem of scaling reinforcement learning from verifiable rewards (RLVR) for mathematical reasoning tasks where only final answers are verifiable, by proposing MR-RLVR which uses masked-and-reordered self-supervision on intermediate reasoning steps. It achieves average relative gains over original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8 on mathematical benchmarks.

Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes