LGFeb 9, 2021

RL for Latent MDPs: Regret Guarantees and a Lower Bound

arXiv:2102.04939v188 citations
Originality Incremental advance
AI Analysis

This work addresses the fundamental problem of efficient learning in partially observable environments for reinforcement learning agents, providing theoretical bounds and conditions for tractability.

This paper investigates regret minimization in Latent Markov Decision Processes (LMDPs), where an agent interacts with one of M possible MDPs without knowing its identity. The authors demonstrate that a general LMDP instance requires at least Ω((SA)^M) episodes to approximate an optimal policy, but that polynomial episodes are sufficient under specific separation assumptions.

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least $Ω((SA)^M)$ episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, {\it i.e.,} providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., Boots et al.) and a reachability assumption, we show that the need for initialization can be removed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes