LGMar 15, 2024

Horizon-Free Regret for Linear Markov Decision Processes

arXiv:2403.10738v16 citationsh-index: 4ICLR
Originality Highly original
AI Analysis

This work addresses a key limitation in RL theory for settings with large state spaces, providing a theoretical advance for researchers in machine learning and AI.

The paper tackles the problem of achieving horizon-free regret bounds in reinforcement learning for linear Markov Decision Processes (MDPs), where the transition model can be exponentially large or uncountable, and it presents the first such bound by directly estimating value functions with confidence sets, resulting in a regret bound independent of the planning horizon.

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes