LGGTITSTMLJun 8, 2022

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Princeton
arXiv:2206.04044v219 citationsh-index: 110
AI Analysis

This addresses the problem of efficient offline reinforcement learning in competitive multi-agent settings for researchers and practitioners, representing a significant advance rather than an incremental improvement.

The paper tackles learning Nash equilibria in two-player zero-sum Markov games from offline data by proposing a pessimistic model-based algorithm called VI-LCB-Game, achieving a sample complexity of C*S(A+B)/((1-γ)^3ε^2) that is minimax optimal and improves prior work by a factor of min{A,B}.

This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $γ$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-γ)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-γ}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes