LGMLFeb 25, 2021

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

arXiv:2102.12611v120 citations
Originality Highly original
AI Analysis

This work addresses computational inefficiencies in policy iteration algorithms for reinforcement learning, offering a novel theoretical justification for experience replay.

The authors improved the regret bound of the Politex algorithm from O(T^{3/4}) to O(√T) in infinite-horizon undiscounted MDPs with function approximation, and proposed an efficient implementation using experience replay that enhances performance in wall-clock time.

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes