Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games
This work addresses the challenge of convergence in multi-agent reinforcement learning for game theory, providing theoretical guarantees for practical algorithms, though it is incremental in building upon smoothed best-response dynamics and minimax value iteration.
The paper tackles the problem of learning Nash equilibria in two-player zero-sum matrix and stochastic games using payoff-based independent learning dynamics, achieving finite-sample analysis with last-iterate guarantees, including sample complexities of O(ε^{-1}) for Nash distributions in matrix games and O(ε^{-8}) for Nash equilibria in both settings.
In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(ε^{-1})$ to find the Nash distribution and a sample complexity of $O(ε^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(ε^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.