LGJun 23, 2022

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

arXiv:2206.11489v338 citationsh-index: 6
Originality Highly original
AI Analysis

This provides a nearly optimal solution for reinforcement learning in linear MDPs, addressing a key theoretical bottleneck in the field.

The paper tackles reinforcement learning with linear function approximation in episodic inhomogeneous linear MDPs, proposing the LSVI-UCB+ algorithm that achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound, which is nearly minimax optimal up to logarithmic factors and closes a $\sqrt{Hd}$ gap from prior work.

We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbolφ(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. This is a minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in (Jin et al., 2020) and lower bound of $Ω(Hd\sqrt{T})$ for linear MDPs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes