Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
This addresses the challenge of non-stationary objectives in deep RL for practitioners, though it is incremental as it builds on existing meta-learning and bandit methods.
The paper tackles the problem of learning rate selection in deep reinforcement learning by introducing LRRL, a meta-learning approach that dynamically chooses learning rates based on policy performance, achieving competitive or superior results on Atari and MuJoCo benchmarks.
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.