Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
This addresses the problem of efficient offline reinforcement learning in competitive multi-agent settings for researchers and practitioners, representing a significant advance rather than an incremental improvement.
The paper tackles learning Nash equilibria in two-player zero-sum Markov games from offline data by proposing a pessimistic model-based algorithm called VI-LCB-Game, achieving a sample complexity of C*S(A+B)/((1-γ)^3ε^2) that is minimax optimal and improves prior work by a factor of min{A,B}.
This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $γ$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-γ)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-γ}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.