LGAIMay 21

Self-Play Reinforcement Learning under Imperfect Information in Big 2

arXiv:2605.288634.6h-index: 2
Predicted impact top 93% in LG · last 90 daysOriginality Synthesis-oriented
AI Analysis

Provides a controlled benchmark for studying deep RL in imperfect-information multiplayer games with delayed rewards and variable action sets.

The paper develops a self-play RL framework for the imperfect-information multiplayer card game Big 2 and shows that PPO outperforms other RL methods (Monte Carlo Q, SARSA, Q-learning) against various opponents. Moderate entropy regularization and current-policy self-play further improve performance.

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes