AIMAMay 5

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

arXiv:2605.0431228.1h-index: 1
Predicted impact top 89% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For AI capability tracking, this benchmark addresses the critical issues of benchmark saturation and contamination by providing a dynamic, adversarial evaluation environment.

The paper introduces Agent Island, a dynamic multiagent game benchmark designed to resist saturation and contamination, and demonstrates that openai/gpt-5.5 achieves a posterior mean skill of 5.64, significantly outperforming other models. It also reveals an 8.3 percentage point same-provider preference in final-round votes.

Static capabilities benchmarks suffer from saturation and contamination, making it difficult to track capabilities progress over time. We introduce Agent Island, a multiplayer simulation environment in which language-model agents compete in a game of interagent cooperation, conflict, and persuasion. The environment yields a dynamic benchmark designed to mitigate both saturation and contamination; new models can always outperform the current leading player in this winner-take-all game, and agents compete against other adaptive agents rather than face a fixed task set. We rank players with a Bayesian Plackett-Luce model, allowing us to quantify uncertainty in player skill. In 999 games involving 49 unique models, openai/gpt-5.5 dominates its peers with a posterior mean skill of 5.64, compared with 3.10 for the second-ranked model, openai/gpt-5.2, and 2.86 for the third-ranked model, openai/gpt-5.3-codex. We release the game logs as a dataset for analyses of model behavior. As an example, we investigate same-provider preference in final-round votes and find that models are 8.3 p.p. more likely to support a same-provider finalist than finalists from other providers. This preference is not uniform across providers: among separately estimated providers, the effect is strongest for OpenAI models and weakest for Anthropic models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes