Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall
This work provides a model-free learning algorithm for finding Nash equilibria in a specific class of imperfect information games, which is relevant for researchers working on game theory and multi-agent reinforcement learning.
This paper addresses the problem of learning a Nash equilibrium in two-player, zero-sum, episodic, tabular imperfect information games with perfect recall and bandit feedback. The proposed Implicit Exploration Online Mirror Descent (IXOMD) algorithm achieves a high-probability convergence rate to the Nash equilibrium of order 1/sqrt(T), where T is the number of played games.
We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/\sqrt{T}$ where $T$ is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.