LG MLDec 12, 2022

VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

arXiv:2212.06069v124.957 citationsh-index: 60

Originality Highly original

AI Analysis

This provides a computationally efficient and statistically optimal solution for reinforcement learning in linear MDPs, addressing a key bottleneck in the field.

The paper tackles the problem of model-free reinforcement learning with nonlinear function approximation and sparse rewards by proposing a new algorithm, VOQL, which achieves asymptotically optimal regret, specifically $ ilde{O}(d\sqrt{HT}+d^6H^{5})$ for linear MDPs.

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

View on arXiv PDF

Similar