AI MAMay 5

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

arXiv:2605.0431228.1h-index: 1

Predicted impact top 89% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For AI capability tracking, this benchmark addresses the critical issues of benchmark saturation and contamination by providing a dynamic, adversarial evaluation environment.

The paper introduces Agent Island, a dynamic multiagent game benchmark designed to resist saturation and contamination, and demonstrates that openai/gpt-5.5 achieves a posterior mean skill of 5.64, significantly outperforming other models. It also reveals an 8.3 percentage point same-provider preference in final-round votes.

Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.

View on arXiv PDF

Similar