LGSTMLMay 24, 2023

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

arXiv:2305.15546v211 citations
Originality Highly original
AI Analysis

This addresses a crucial bottleneck in reinforcement learning for researchers and practitioners by providing a more efficient and practical algorithm for online learning in discounted MDPs, though it is incremental as it builds on existing variance reduction techniques.

The paper tackles the problem of achieving regret optimality in tabular infinite-horizon discounted Markov decision processes with low computational cost and short burn-in time, introducing a model-free algorithm that is the first to be regret-optimal in this setting while reducing burn-in time.

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes