LGAIMLJun 22, 2020

Near-Optimal Reinforcement Learning with Self-Play

arXiv:2006.12007v2147 citations
AI Analysis

It provides near-optimal algorithms for learning in Markov games, which is important for AI in competitive environments like game theory and robotics, though it is incremental in improving sample complexity bounds.

This paper tackles the problem of designing optimal reinforcement learning algorithms for two-player zero-sum games using self-play, achieving a sample complexity of ω(S(A+B)) that nearly matches the lower bound, closing a significant gap from previous methods.

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $Ω(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes