Averaging $n$-step Returns Reduces Variance in Reinforcement Learning
This addresses a key bottleneck in reinforcement learning for improving sample efficiency, though it is incremental as it builds on existing multistep methods.
The paper tackles the problem of high variance in multistep returns in reinforcement learning, which limits their effectiveness, and demonstrates that using compound returns (weighted averages of n-step returns) reduces variance, leading to improved sample efficiency in agents like DQN and PPO.
Multistep returns, such as $n$-step returns and $λ$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of $n$-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that compound returns often increase the sample efficiency of $n$-step deep RL agents like DQN and PPO.