AlphaExploitem: Going Beyond the Nash Equilibrium in Poker by Learning to Exploit Suboptimal Play
For AI in imperfect-information games, this work provides a method to go beyond Nash equilibrium by learning to exploit suboptimal play, a practical problem for real-world deployment.
AlphaExploitem extends AlphaHoldem to learn policies that exploit suboptimal opponents in poker, achieving higher utility against weak players while maintaining Nash equilibrium performance against optimal opponents.
Poker is an imperfect information game that has served as a long-standing benchmark for decision-making under uncertainty. To maximize utility beyond the Nash equilibrium, an agent can deviate from Nash-equilibrium policies to exploit suboptimal play. We introduce AlphaExploitem, which extends the competitive RL poker agent AlphaHoldem by using a hierarchical transformer encoder that enables reasoning over previously played hands and modifying the training procedure with the inclusion of a diverse pool of exploitable opponents to facilitate learning to exploit. We train and evaluate AlphaExploitem on two standard benchmarks for imperfect-information games. Empirically, AlphaExploitem successfully exploits weak play by both in- and out-of-distribution opponents, without losing performance against NE opponents.