LG MLJan 22, 2025

The regret lower bound for communicating Markov Decision Processes

arXiv:2501.13013v19.42 citationsh-index: 27

Originality Highly original

AI Analysis

This work addresses a theoretical gap in reinforcement learning for researchers, providing foundational insights into regret bounds in non-ergodic MDPs, though it is incremental as it builds on known results for ergodic MDPs.

The paper tackles the problem of extending regret lower bounds from ergodic to communicating Markov Decision Processes (MDPs), proving that the lower bound becomes significantly more complex and involves intertwined explorative and co-explorative behaviors, with computational hardness results such as Σ₂ᴾ-hardness and coNP-hardness for testing feasibility.

This paper is devoted to the extension of the regret lower bound beyond ergodic Markov decision processes (MDPs) in the problem dependent setting. While the regret lower bound for ergodic MDPs is well-known and reached by tractable algorithms, we prove that the regret lower bound becomes significatively more complex in communicating MDPs. Our lower bound revisits the necessary explorative behavior of consistent learning agents and further explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones, a phenomenon that we refer to as co-exploration. In tandem, we show that these two explorative and co-explorative behaviors are intertwined with navigation constraints obtained by scrutinizing the navigation structure at logarithmic scale. The resulting lower bound is expressed as the solution of an optimization problem that, in many standard classes of MDPs, can be specialized to recover existing results. From a computational perspective, it is provably $Σ_2^\textrm{P}$-hard in general and as a matter of fact, even testing the membership to the feasible region is coNP-hard. We further provide an algorithm to approximate the lower bound in a constructive way.

View on arXiv PDF

Similar