GT AI LGOct 25, 2022

Learning in Multi-Player Stochastic Games

arXiv:2210.14280v12.31 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses computational challenges in multi-agent reinforcement learning for researchers and practitioners, offering incremental algorithmic improvements for specific game types.

The paper tackles the intractability of Nash equilibria in multi-player stochastic games by focusing on correlated equilibria, showing that achieving sublinear regret against non-stationary policies is NP-hard even for short horizons, and provides algorithms to generate extensive-form correlated equilibria with runtime exponential in horizon but polynomial in other parameters, including polynomial-time methods for fast-mixing games.

We consider the problem of simultaneous learning in stochastic games with many players in the finite-horizon setting. While the typical target solution for a stochastic game is a Nash equilibrium, this is intractable with many players. We instead focus on variants of {\it correlated equilibria}, such as those studied for extensive-form games. We begin with a hardness result for the adversarial MDP problem: even for a horizon of 3, obtaining sublinear regret against the best non-stationary policy is \textsf{NP}-hard when both rewards and transitions are adversarial. This implies that convergence to even the weakest natural solution concept -- normal-form coarse correlated equilbrium -- is not possible via black-box reduction to a no-regret algorithm even in stochastic games with constant horizon (unless $\textsf{NP}\subseteq\textsf{BPP}$). Instead, we turn to a different target: algorithms which {\it generate} an equilibrium when they are used by all players. Our main result is algorithm which generates an {\it extensive-form} correlated equilibrium, whose runtime is exponential in the horizon but polynomial in all other parameters. We give a similar algorithm which is polynomial in all parameters for "fast-mixing" stochastic games. We also show a method for efficiently reaching normal-form coarse correlated equilibria in "single-controller" stochastic games which follows the traditional no-regret approach. When shared randomness is available, the two generative algorithms can be extended to give simultaneous regret bounds and converge in the traditional sense.

View on arXiv PDF

Similar