LGAIOct 21, 2021

Is High Variance Unavoidable in RL? A Case Study in Continuous Control

arXiv:2110.11222v237 citations
Originality Incremental advance
AI Analysis

This addresses reproducibility and safety issues in RL for researchers and practitioners, though it is incremental as it builds on existing actor-critic methods.

The paper investigated the causes of high variance in reinforcement learning experiments, focusing on continuous control from pixels, and found that early training variance due to numerical instability can be effectively reduced by normalizing penultimate features, allowing for larger learning rates and significantly decreasing outcome variance.

Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor "outlier" runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes