LG MLAug 26, 2022

Dynamic Regret of Online Markov Decision Processes

arXiv:2208.12483v112.422 citationsh-index: 21

Originality Highly original

AI Analysis

This work addresses the problem of dynamic regret in online MDPs for researchers in reinforcement learning and online optimization, offering theoretical guarantees and optimality results.

The paper tackles online Markov Decision Processes with adversarial losses by proposing novel ensemble algorithms for three models, achieving minimax optimal dynamic regret bounds for episodic Stochastic Shortest Path problems and demonstrating improved bounds with predictable environments.

We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs.

View on arXiv PDF

Similar