LGAug 19, 2021

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

arXiv:2108.08812v1137 citations
Originality Incremental advance
AI Analysis

This work addresses the theoretical gap for practitioners in offline RL, though it is incremental as it builds on existing actor-critic frameworks.

The paper tackles the theoretical understanding of actor-critic methods in offline reinforcement learning by proposing a new algorithm that incorporates pessimism, operating in a more general setting than low-rank MDPs and proving an upper bound on suboptimality gap with a matching minimax lower bound.

Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor's policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes