LG AI OC MLMar 17, 2021

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Lin Chen, Bruno Scherrer, Peter L. Bartlett

arXiv:2103.09847v114.617 citationsh-index: 68

Originality Incremental advance

AI Analysis

This addresses a fundamental theoretical problem in offline RL for researchers, providing both lower bounds and algorithmic guarantees, though it is incremental in refining existing analysis.

The paper tackles the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning with linear function approximation, identifying a hard regime where the lower bound is exponential in dimension, and proposes an algorithm with polynomial sample complexity under low distribution shift assumptions.

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $dγ^{2}>1$, where $d$ is the dimension of the feature vector and $γ$ is the discount rate. In this regime, for any $q\in[γ^{2},1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $Ω\left(\frac{d}{γ^{2}\left(q-γ^{2}\right)\varepsilon^{2}}\exp\left(Θ\left(dγ^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$. Note that the lower bound of the sample complexity is exponential in $d$. If $q=γ^{2}$, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left\{ \frac{\left\Vert θ^π\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}δ,\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}δ\right)\right\} \right)$ samples ($θ^π$ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of $\varepsilon$ with probability at least $1-δ$.

View on arXiv PDF

Similar