LGOCOct 3, 2022

Square-root regret bounds for continuous-time episodic Markov decision processes

arXiv:2210.00832v27 citationsh-index: 58
AI Analysis

This work addresses the problem of efficient learning in continuous-time MDPs for researchers and practitioners in reinforcement learning, representing an incremental advance by extending regret analysis to this setting.

The authors tackled reinforcement learning for continuous-time Markov decision processes (MDPs) in the episodic setting by developing an algorithm based on value iteration and upper confidence bounds, achieving worst-case expected regret bounds of order square-root on the number of episodes, with simulation experiments illustrating its performance.

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state--action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes