AICLLGMLSep 26, 2025

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

arXiv:2509.22613v14 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the theoretical underpinnings of RL for LLM planning, which is incremental as it builds on existing methods to analyze and mitigate known bottlenecks.

The paper investigates the theoretical benefits and limitations of reinforcement learning (RL) for language model planning, finding that RL improves generalization through exploration but suffers from issues like diversity collapse in policy gradient methods, while Q-learning offers advantages such as off-policy learning and diversity preservation.

Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes