LGFeb 19, 2025

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

arXiv:2502.13900v217.911 citationsh-index: 28COLT

Originality Highly original

AI Analysis

This work provides a new, provably efficient algorithm for reinforcement learning in a specific type of MDP, which is relevant for researchers working on theoretical guarantees in RL.

This paper addresses reinforcement learning in infinite-horizon discounted linear Markov decision processes, proposing the first computationally efficient algorithm that achieves rate-optimal regret guarantees. The algorithm combines additive exploration bonuses and artificial transitions, resulting in a regret of order \tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T}).

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $γ\in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

View on arXiv PDF

Similar