LG MLFeb 2

Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

arXiv:2602.01903v11.4h-index: 2

Originality Incremental advance

AI Analysis

This work addresses the problem of improving regret bounds for online MDPs, offering incremental algorithmic advancements for reinforcement learning researchers.

The paper tackles online episodic tabular MDPs with known transitions by developing best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in adversarial regimes and variance-dependent bounds in stochastic regimes, resulting in near-optimal polylogarithmic gap-dependent bounds.

This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

View on arXiv PDF

Similar