GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
For researchers in multi-agent reinforcement learning for imperfect-information games, this work addresses a known variance bottleneck in GAE for self-play, offering a practical improvement.
The paper identifies that Generalized Advantage Estimation (GAE) introduces additional variance in imperfect-information self-play reinforcement learning due to stochastic future action sampling, which is amplified in equilibrium self-play. The authors propose Variance-Reduced Policy Optimization (VRPO) with a Q-boosting advantage estimator, achieving strong performance in games like Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(λ)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.