LGAIFeb 8, 2022

Provable Reinforcement Learning with a Short-Term Memory

arXiv:2202.03983v146 citations
Originality Highly original
AI Analysis

This work addresses the problem of efficient reinforcement learning in partially observable environments for researchers and practitioners, offering provable guarantees with reduced sample complexity, though it is incremental as it focuses on a specific subclass rather than general POMDPs.

The paper tackles the challenge of partial observability in reinforcement learning by proposing a new subclass of POMDPs where latent states can be decoded from recent history of a short length m, establishing sample complexity bounds for learning near-optimal policies in tabular and rich-observation settings, with algorithms achieving sample complexity scaling exponentially with m rather than the horizon and independent of the number of observations.

Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes