LGAICVJul 3, 2022

Stabilizing Off-Policy Deep Reinforcement Learning from Pixels

arXiv:2207.00986v144 citationsh-index: 12
Originality Highly original
AI Analysis

This addresses a critical stability problem for researchers and practitioners in deep reinforcement learning, offering a novel solution that is not incremental.

The paper tackled the instability of off-policy reinforcement learning from pixel observations by identifying a new 'visual deadly triad' causing catastrophic self-overfitting, and proposed A-LIX, which outperformed prior state-of-the-art on DeepMind Control and Atari 100k benchmarks without data augmentation or auxiliary losses.

Off-policy reinforcement learning (RL) from pixel observations is notoriously unstable. As a result, many successful algorithms must combine different domain-specific practices and auxiliary losses to learn meaningful behaviors in complex environments. In this work, we provide novel analysis demonstrating that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We show that this new visual deadly triad causes unstable training and premature convergence to degenerate solutions, a phenomenon we name catastrophic self-overfitting. Based on our analysis, we propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting using a dual objective. By applying A-LIX, we significantly outperform the prior state-of-the-art on the DeepMind Control and Atari 100k benchmarks without any data augmentation or auxiliary losses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes