LGAIMLJun 13, 2021

Bellman-consistent Pessimism for Offline Reinforcement Learning

arXiv:2106.06926v6330 citations
Originality Highly original
AI Analysis

This addresses the issue of discovering good policies in offline RL for researchers and practitioners by providing a more robust and efficient method without requiring hyperparameter tuning.

The paper tackles the problem of overly pessimistic reasoning in offline reinforcement learning by introducing Bellman-consistent pessimism, which implements pessimism at the initial state over functions consistent with Bellman equations, improving sample complexity by O(d) in linear function approximation with finite action spaces and adapting automatically to bias-variance tradeoffs.

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes