LGGTMar 11, 2022

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

arXiv:2203.05920v41 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses a core challenge in game theory and reinforcement learning for AI agents, offering a modular and computationally efficient solution with significant theoretical improvements.

The paper tackles the problem of learning approximate Nash equilibrium in two-player zero-sum imperfect information extensive-form games under bandit feedback, proposing a generalized framework and a novel method called SIX-OMD that improves the convergence rate from O(√(XB/T) + √(YC/T)) to O(√(M_X/T) + √(M_Y/T)).

Regret minimization methods are a powerful tool for learning approximate Nash equilibrium (NE) in two-player zero-sum imperfect information extensive-form games (IIEGs). We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. In general, only the interactive trajectory and the reached terminal node value $v(z^t)$ are revealed. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $\ell^t$ by $v(z^t)$ and minimize the regret. In this paper, we propose a generalized framework for this learning setting. It presents a theoretical framework for the design and the modular analysis of the bandit regret minimization methods. We demonstrate that the most recent bandit regret minimization methods can be analyzed as a particular case of our framework. Following this framework, we describe a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(\sqrt{X B/T}+\sqrt{Y C/T})$ to $O(\sqrt{ M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$. Moreover, SIX-OMD is computationally efficient as it needs to perform the current strategy and average strategy updates only along the sampled trajectory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes