LGJun 5, 2022

Learning Dynamics and Generalization in Reinforcement Learning

arXiv:2206.02126v117 citationsh-index: 64
Originality Incremental advance
AI Analysis

This addresses the generalization problem in reinforcement learning for AI agents, but it is incremental as it builds on existing TD methods and distillation techniques.

The paper analyzes how temporal difference (TD) learning in reinforcement learning encourages fitting non-smooth value functions early, which weakens generalization compared to other methods like policy gradients, and shows that post-training policy distillation improves generalization and robustness in environments like ProcGen.

Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pitfall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes