LGAIDec 4, 2023

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

arXiv:2312.02355v13 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of selecting policies efficiently in offline RL, which is crucial for deployment but previously lacked fundamental understanding, though it is incremental in connecting OPS to existing OPE methods.

The paper tackles the problem of offline policy selection (OPS) in reinforcement learning, showing that in the worst case, OPS is as hard as off-policy policy evaluation (OPE) and no method can be more sample efficient, but proposes a Bellman error method (IBES) that can be more efficient under certain conditions, with empirical validation on an Atari benchmark.

Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes