Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
For multi-agent RL practitioners, this provides a simpler, pessimism-free approach to offline learning in general-sum games, with theoretical guarantees.
The paper shows that KL regularization alone can stabilize offline multi-agent reinforcement learning in general-sum games, achieving equilibrium recovery without manual pessimism. The proposed GANE method recovers regularized Nash equilibria at an accelerated rate of O~(1/n), while GAMD converges to a Coarse Correlated Equilibrium at O~(1/√n+1/T).
Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.