GT LGApr 8

Beyond Pessimism: Offline Learning in KL-regularized Games

arXiv:2604.0673864.45 citationsh-index: 2

Predicted impact top 3% in GT · last 90 daysOriginality Highly original

AI Analysis

This addresses the problem of distribution shift in offline reinforcement learning for game theory, offering a faster statistical rate that is incremental but with specific gains.

The paper tackles offline learning in KL-regularized two-player zero-sum games by developing a pessimism-free algorithm and analytical framework, achieving a sample complexity bound of $\widetilde{\mathcal{O}}(1/n)$, which improves upon prior work's $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates.

We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast $\widetilde{\mathcal{O}}(1/n)$ statistical rate as the minimax estimator.

View on arXiv PDF

Similar