LG MLSep 12, 2021

Improved Algorithms for Misspecified Linear Markov Decision Processes

Daniel Vial, Advait Parulekar, Sanjay Shakkottai, R. Srikant

arXiv:2109.05546v28.48 citations

Originality Highly original

AI Analysis

This addresses the challenge of robust decision-making in reinforcement learning under model misspecification, offering a novel algorithm with practical efficiency gains.

The paper tackles the problem of misspecified linear Markov decision processes (MLMDPs) by proposing an algorithm with regret scaling as K max{ε_mis, ε_tol}, bounded complexities, and no need for ε_mis input, improving existing bounds for specific ε_tol choices.

For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after $K$ episodes scales as $K \max \{ \varepsilon_{\text{mis}}, \varepsilon_{\text{tol}} \}$, where $\varepsilon_{\text{mis}}$ is the degree of misspecification and $\varepsilon_{\text{tol}}$ is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as $K \rightarrow \infty$. (P3) It does not require $\varepsilon_{\text{mis}}$ as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of $\varepsilon_{\text{tol}}$, we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) for contextual bandits. We also provide an intuitive interpretation of their result, which informs the design of our algorithm.

View on arXiv PDF

Similar