LGOCMLJun 20, 2023

Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

arXiv:2306.11455v18 citationsh-index: 36
Originality Incremental advance
AI Analysis

This addresses a critical issue in reinforcement learning for applications with heavy-tailed rewards, offering provable robustness, though it is incremental as it builds on existing TD learning with a new clipping mechanism.

The paper tackles the problem of reinforcement learning with heavy-tailed reward distributions, which can cause existing methods to fail due to outliers, by proposing a robust temporal difference learning method with dynamic gradient clipping, achieving sample complexities of order O(ε^{-1/p}) and O(ε^{-1-1/p}) under specific conditions.

In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon^{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon^{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon^{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes