The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions
This work addresses a gap in RL theory for high-dimensional settings, offering insights for researchers and practitioners, though it is incremental in bridging theory and practice.
The authors tackled the problem of understanding policy learning dynamics in high-dimensional reinforcement learning by proposing a solvable model that yields closed-form ODEs, and they derived optimal learning schedules and identified phenomena like delayed learning and speed-accuracy trade-offs, with experiments on games like 'Bossfight' and 'Pong' showing practical trade-offs.
Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.