LGJan 12

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape

Chris Elliott, Einar Urdshals, David Quarel, Matthew Farrugia-Roberts, Daniel Murfet

arXiv:2601.07524v15.83 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the theoretical understanding of learning dynamics in reinforcement learning, providing a geometric framework for analyzing policy transitions, though it is incremental as it builds on existing singular learning theory.

The paper extends singular learning theory to deep reinforcement learning, showing that Bayesian phase transitions in policy learning are governed by the local learning coefficient (LLC), which predicts transitions from simple, high-regret policies to complex, low-regret ones, with empirical verification in a gridworld environment showing sharp regret decreases and LLC increases.

Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to deep reinforcement learning, proving that the concentration of the generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that Bayesian phase transitions in reinforcement learning should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over SGD training manifest as "opposing staircases" where regret decreases sharply while the LLC increases. Notably, the LLC detects phase transitions even when estimated on a subset of states where the policies appear identical in terms of regret, suggesting it captures changes in the underlying algorithm rather than just performance.

View on arXiv PDF

Similar