LG MLFeb 12, 2020

Regret Bounds for Discounted MDPs

arXiv:2002.05138v314.723 citations

Originality Incremental advance

AI Analysis

This work addresses a fundamental issue in reinforcement learning for researchers and practitioners by providing a more appropriate performance metric for finite-time scenarios, though it is incremental as it builds on existing regret concepts.

The paper tackles the problem of measuring performance in non-episodic reinforcement learning by proposing a new family of measures called γ-regret to better capture finite-time optimality, and it derives lower and upper bounds for these measures, with a follow-up work later closing the gap to a specific bound.

Reinforcement learning (RL) has traditionally been understood from an episodic perspective; the concept of non-episodic RL, where there is no restart and therefore no reliable recovery, remains elusive. A fundamental question in non-episodic RL is how to measure the performance of a learner and derive algorithms to maximize such performance. Conventional wisdom is to maximize the difference between the average reward received by the learner and the maximal long-term average reward. In this paper, we argue that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. We propose a family of measures, called $γ$-regret, which we believe to better capture the finite-time optimality. We give motivations and derive lower and upper bounds for such measures. Note: A follow-up work (arXiv:2010.00587) has improved both our lower and upper bound, the gap is now closed at $\tildeΘ\left(\frac{\sqrt{SAT}}{(1 - γ)^{\frac{1}{2}}}\right)$.

View on arXiv PDF

Similar