MLLGMay 23, 2024

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

arXiv:2405.15050v35 citationsh-index: 54AISTATS
Originality Highly original
AI Analysis

This addresses computational inefficiency and restrictive assumptions in average-reward RL for linear MDPs, offering a more practical solution for researchers and practitioners in reinforcement learning.

The paper tackles the problem of infinite-horizon average-reward reinforcement learning in linear MDPs, where the Bellman operator's non-contraction property poses challenges, and achieves a regret bound of Õ(√T) with polynomial computational complexity without strong dynamics assumptions.

We study the problem of infinite-horizon average-reward reinforcement learning with linear Markov decision processes (MDPs). The associated Bellman operator of the problem not being a contraction makes the algorithm design challenging. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{O}(\sqrt{T})$. In this paper, we propose the first algorithm that achieves $\widetilde{O}(\sqrt{T})$ regret with computational complexity polynomial in the problem parameters, without making strong assumptions on dynamics. Our approach approximates the average-reward setting by a discounted MDP with a carefully chosen discounting factor, and then applies an optimistic value iteration. We propose an algorithmic structure that plans for a nonstationary policy through optimistic value iteration and follows that policy until a specified information metric in the collected data doubles. Additionally, we introduce a value function clipping procedure for limiting the span of the value function for sample efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes