Sample-Efficient Reinforcement Learning of Partially Observable Markov Games
This work addresses the problem of multi-agent reinforcement learning under partial observability for researchers and practitioners, providing the first sample-efficient results for learning POMGs, though it is incremental as it focuses on a specific subclass.
The paper tackles the challenge of sample-efficient learning in multiplayer general-sum Partially Observable Markov Games (POMGs) under partial observability, identifying a subclass called weakly revealing POMGs where a simple optimistic MLE algorithm finds approximate equilibria in a polynomial number of samples for small agent numbers and achieves sublinear regret against adversarial opponents.
This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.