BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
This provides a scalable and reusable framework for assessing interactive AI capabilities, addressing a bottleneck in LLM evaluation for researchers and developers, though it is incremental in applying existing game AI hierarchies to a new context.
The paper tackles the challenge of systematically evaluating LLMs in interactive strategic environments by introducing BotzoneBench, a scalable framework that anchors evaluation to fixed hierarchies of skill-calibrated game AI, enabling linear-time absolute skill measurement. The result shows significant performance disparities among five flagship models, with top models achieving proficiency comparable to mid-to-high-tier specialized game AI across eight diverse games, based on assessment of 177,047 state-action pairs.
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.