LGOct 27, 2021

Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection

arXiv:2110.14798v125 citations
Originality Highly original
AI Analysis

This work addresses the challenge of regret minimization in reinforcement learning for linear MDPs, which is incremental as it builds on existing settings like low-rank MDPs and Bellman closure.

The paper tackles the problem of achieving constant regret in linear MDPs by identifying a necessary and sufficient condition (UNISOFT) for representation of value functions, and demonstrates that optimistic algorithms like LSVI-UCB and ELEANOR achieve constant regret under this condition, with a proposed algorithm for representation selection also achieving constant regret.

We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure. We first derive a necessary condition on the representation, called universally spanning optimal features (UNISOFT), to achieve constant regret in any MDP with linear reward function. This result encompasses the well-known settings of low-rank MDPs and, more generally, zero inherent Bellman error (also known as the Bellman closure assumption). We then demonstrate that this condition is also sufficient for these classes of problems by deriving a constant regret bound for two optimistic algorithms (LSVI-UCB and ELEANOR). Finally, we propose an algorithm for representation selection and we prove that it achieves constant regret when one of the given representations, or a suitable combination of them, satisfies the UNISOFT condition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes