LG SYNov 17, 2025

DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

arXiv:2511.13186v14.1h-index: 1

Originality Incremental advance

AI Analysis

This work addresses the problem of slow or failed convergence in self-play reinforcement learning for continuous decision spaces, which is incremental as it builds on existing fictitious play and diffusion methods.

The paper tackles the challenge of learning robust and adaptive behaviors in continuous-space multi-agent games by proposing DiffFP, a diffusion-based fictitious play framework that converges to ε-Nash equilibria, achieving up to 3× faster convergence and 30× higher success rates compared to baseline reinforcement learning methods.

Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $ε$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

View on arXiv PDF

Similar