MA AI GT LGFeb 29, 2024

Offline Fictitious Self-Play for Competitive Games

Jingxiao Chen, Weiji Xie, Weinan Zhang, Yong yu, Ying Wen

arXiv:2403.00841v22.31 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses offline RL for competitive games, enabling policy improvement without online interactions, which is incremental but practical for real-world applications lacking simulators.

The paper tackles the challenge of offline multi-agent reinforcement learning in competitive games by introducing OFF-FSP, a model-free algorithm that simulates opponent interactions and approximates Nash equilibrium, achieving significantly lower exploitability than state-of-the-art baselines in experiments on matrix games, poker, and board games.

Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.

View on arXiv PDF

Similar