A Finite-Sample Analysis of Payoff-Based Independent Learning in Zero-Sum Stochastic Games
This addresses the challenge of efficient learning in multi-agent stochastic games, providing theoretical guarantees for independent learning dynamics, though it is incremental as it builds on existing methods like TD-learning and best-response dynamics.
The paper tackles the problem of learning in two-player zero-sum stochastic games by proposing Doubly Smoothed Best-Response dynamics, which integrates smoothed best-response into TD-learning and minimax value iteration, achieving finite-sample guarantees with an O~(1/ε^2) sample complexity bound for payoff-based independent learning dynamics, and a sharper O~(1/ε) bound for matrix games.
We study two-player zero-sum stochastic games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics, which integrates a discrete and doubly smoothed variant of the best-response dynamics into temporal-difference (TD)-learning and minimax value iteration. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players. Our main results provide finite-sample guarantees. In particular, we prove the first-known $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity bound for payoff-based independent learning dynamics, up to a smoothing bias. In the special case where the stochastic game has only one state (i.e., matrix games), we provide a sharper $\tilde{\mathcal{O}}(1/ε)$ sample complexity. Our analysis uses a novel coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.