Return-based Scaling: Yet Another Normalisation Trick for Deep RL
This addresses a mundane but irritating problem for RL practitioners by improving learning speed and stability without manual tuning, though it is incremental as it builds on existing normalization techniques.
The paper tackles scaling issues in deep reinforcement learning, which vary across domains and stages, by proposing a return-based scaling method that requires no tuning, clipping, or adaptation, and validates it on Atari games, showing effectiveness in mitigating interference when training on multiple targets with different reward scales or discounting.
Scaling issues are mundane yet irritating for practitioners of reinforcement learning. Error scales vary across domains, tasks, and stages of learning; sometimes by many orders of magnitude. This can be detrimental to learning speed and stability, create interference between learning tasks, and necessitate substantial tuning. We revisit this topic for agents based on temporal-difference learning, sketch out some desiderata and investigate scenarios where simple fixes fall short. The mechanism we propose requires neither tuning, clipping, nor adaptation. We validate its effectiveness and robustness on the suite of Atari games. Our scaling method turns out to be particularly helpful at mitigating interference, when training a shared neural network on multiple targets that differ in reward scale or discounting.