When Is Partially Observable Reinforcement Learning Not Scary?
This addresses the problem of sample inefficiency in RL for practitioners dealing with partial observability, though it is incremental as it focuses on a specific subclass rather than general POMDPs.
The paper tackles the challenge of partially observable reinforcement learning (RL) by identifying a subclass called weakly revealing POMDPs, where learning is tractable, and proves that a simple algorithm combining optimism and MLE achieves polynomial sample complexity, specifically for overcomplete POMDPs with more latent states than observations.
Applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system, that is, they act under partial observability of the states, are ubiquitous. Partially observable RL can be notoriously difficult -- well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the existence of large subclasses of POMDPs over which learning is tractable. In this paper we identify such a subclass, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee polynomial sample complexity. To the best of our knowledge, this is the first provably sample-efficient result for learning from interactions in overcomplete POMDPs, where the number of latent states can be larger than the number of observations.