LGOCJul 8, 2021

Sublinear Regret for Learning POMDPs

arXiv:2107.03635v427 citations
Originality Highly original
AI Analysis

This addresses the challenge of efficient learning in POMDPs, a key problem in AI for sequential decision-making under uncertainty, with a novel theoretical guarantee.

The paper tackles the problem of model-based reinforcement learning for partially observable Markov decision processes (POMDPs) by proposing a learning algorithm that achieves a regret bound of O(T^{2/3}√log T), which is the first sublinear regret result for general POMDPs.

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes