LGCLSep 30, 2025

Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models

arXiv:2509.26628v18 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses a bottleneck in PSRL for reasoning models, offering incremental improvements in exploration efficiency.

The paper tackles the problem of limited exploration efficiency in Process-Supervised Reinforcement Learning (PSRL) for reasoning models by introducing AttnRL, which uses attention scores to guide branching and adaptive sampling, resulting in improved performance and efficiency on mathematical reasoning benchmarks.

Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes