LGOCMLMay 23, 2022

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

arXiv:2205.11168v49 citationsh-index: 58
Originality Highly original
AI Analysis

This work addresses the problem of efficient learning in continuous-time MDPs for applications like queuing or control systems, representing a strong theoretical advance but incremental relative to discrete-time methods.

The paper tackles reinforcement learning for continuous-time Markov decision processes in the average-reward setting, deriving instance-dependent logarithmic regret lower bounds and designing an algorithm that achieves this rate with finite-time guarantees.

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes