LG AI MLFeb 2, 2021

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

arXiv:2102.01748v125.471 citations

Originality Highly original

AI Analysis

This work provides a near-optimal algorithm for offline reinforcement learning in tabular MDPs, which is significant for researchers and practitioners aiming to optimize policies using only historical data.

This paper addresses offline reinforcement learning (RL) in tabular Markov Decision Processes (MDPs) and proposes Off-Policy Double Variance Reduction (OPDVR). The algorithm provably identifies an ε-optimal policy with Õ(H^2/d_mε^2) episodes of offline data, improving the best known upper bound by a factor of H and matching the information-theoretic lower bound.

We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $ε$-optimal policy with $\widetilde{O}(H^2/d_mε^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $Ω(H^2/d_mε^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

View on arXiv PDF

Similar