LG OC MLDec 15, 2020

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Dongruo Zhou, Quanquan Gu, Csaba Szepesvari

arXiv:2012.08507v232.6235 citations

Originality Highly original

AI Analysis

This work provides the first computationally efficient, nearly minimax optimal algorithms for reinforcement learning with linear function approximation, which is a significant advancement for RL researchers and practitioners.

This paper addresses reinforcement learning in linear mixture Markov Decision Processes, proposing two new algorithms: UCRL-VTR+ for episodic undiscounted settings and UCLK+ for discounted settings. UCRL-VTR+ achieves an O(dHsqrt(T)) regret, matching the lower bound, while UCLK+ achieves an O(dsqrt(T)/(1-gamma)^1.5) regret, also matching the lower bound, making both nearly minimax optimal.

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Ω(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-γ)^{1.5})$ regret, where $γ\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $Ω(d\sqrt{T}/(1-γ)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

View on arXiv PDF

Similar