Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
For researchers working on RL-based reasoning post-training of LLMs, this work reveals that token-level learning signals are heterogeneous and that uniform token averaging may obscure optimization-relevant structure.
The paper studies token-level learning signals in RL-based reasoning post-training for LLMs, finding that low-attention-entropy tokens (anchors) provide stable gradients while high-entropy tokens (explorers) offer volatile but potentially useful signals. A dynamic entropy-aware reweighting method improves Qwen3-8B-Base from 34.39 to 37.40 held-out average.
Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.