LGAIMay 15, 2021

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

arXiv:2105.07253v348 citations
Originality Incremental advance
AI Analysis

This work improves sample efficiency in reinforcement learning, addressing a key bottleneck for researchers and practitioners, though it is incremental as it builds on existing prioritization techniques.

The authors tackled the problem of suboptimal sample prioritization in off-policy reinforcement learning by deriving an optimal strategy from regret minimization, which outperformed previous methods on benchmarks like MuJoCo, Atari, and Meta-World.

In reinforcement learning, experience replay stores past samples for further reuse. Prioritized sampling is a promising technique to better utilize these samples. Previous criteria of prioritization include TD error, recentness and corrective feedback, which are mostly heuristically designed. In this work, we start from the regret minimization objective, and obtain an optimal prioritization strategy for Bellman update that can directly maximize the return of the policy. The theory suggests that data with higher hindsight TD error, better on-policiness and more accurate Q value should be assigned with higher weights during sampling. Thus most previous criteria only consider this strategy partially. We not only provide theoretical justifications for previous criteria, but also propose two new methods to compute the prioritization weight, namely ReMERN and ReMERT. ReMERN learns an error network, while ReMERT exploits the temporal ordering of states. Both methods outperform previous prioritized sampling algorithms in challenging RL benchmarks, including MuJoCo, Atari and Meta-World.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes