LG AIMay 21

Self-Play Reinforcement Learning under Imperfect Information in Big 2

arXiv:2605.288634.6h-index: 2

Predicted impact top 93% in LG · last 90 daysOriginality Synthesis-oriented

AI Analysis

Provides a controlled benchmark for studying deep RL in imperfect-information multiplayer games with delayed rewards and variable action sets.

The paper develops a self-play RL framework for the imperfect-information multiplayer card game Big 2 and shows that PPO outperforms other RL methods (Monte Carlo Q, SARSA, Q-learning) against various opponents. Moderate entropy regularization and current-policy self-play further improve performance.

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

View on arXiv PDF

Similar