Meta-Gradient Reinforcement Learning with an Objective Discovered Online
This addresses the challenge of inflexible objectives in RL for researchers and practitioners, offering an adaptive approach that is incremental over existing meta-learning methods.
The paper tackles the problem of deep reinforcement learning algorithms requiring predefined objectives by proposing a meta-gradient descent algorithm that discovers its own objective online from interactive experience, allowing the agent to adapt and learn more effectively over time. On the Atari Learning Environment, it eventually outperforms a strong actor-critic baseline in median score.
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment. Over time, this allows the agent to learn how to learn increasingly effectively. Furthermore, because the objective is discovered online, it can adapt to changes over time. We demonstrate that the algorithm discovers how to address several important issues in RL, such as bootstrapping, non-stationarity, and off-policy learning. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency, eventually outperforming the median score of a strong actor-critic baseline.