LGMLFeb 25, 2021

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

arXiv:2102.12769v115 citations
Originality Incremental advance
AI Analysis

This work addresses a practical issue for reinforcement learning practitioners dealing with non-standard reward distributions, though it is incremental as it builds on existing methods like UCRL2 and Q-Learning.

The paper tackles the problem of reinforcement learning with heavy-tailed rewards, which are common in real-world systems but not typically addressed by existing algorithms, and shows that their proposed algorithms achieve near-optimal regret bounds and outperform baselines on synthetic and standard benchmarks.

Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes