CLLGFeb 2

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

arXiv:2602.01698v12 citationsh-index: 29Has Code
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck for researchers and practitioners using post-trained reasoning models by enabling better exploration without retraining, though it is incremental as it builds on existing methods.

The paper tackles the problem of exploration collapse in Large Reasoning Models after RL post-training, where temperature-based sampling fails to improve accuracy, and proposes Latent Exploration Decoding (LED), a depth-conditioned decoding strategy that restores exploration by leveraging intermediate layer entropy, resulting in improvements of 0.61 and 1.03 percentage points in pass@1 and pass@16 accuracy across benchmarks.

Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes