AIJan 22

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

Lingfeng Li, Yunlong Lu, Yuefei Zhang, Jingyu Yao, Yixin Zhu, KeYuan Cheng, Yongyi Wang, Qirui Zheng, Xionghui Yang, Wenxin Li

arXiv:2602.13214v16.02 citationsh-index: 12

Originality Incremental advance

AI Analysis

This provides a scalable and reusable framework for assessing interactive AI capabilities, addressing a bottleneck in LLM evaluation for researchers and developers, though it is incremental in applying existing game AI hierarchies to a new context.

The paper tackles the challenge of systematically evaluating LLMs in interactive strategic environments by introducing BotzoneBench, a scalable framework that anchors evaluation to fixed hierarchies of skill-calibrated game AI, enabling linear-time absolute skill measurement. The result shows significant performance disparities among five flagship models, with top models achieving proficiency comparable to mid-to-high-tier specialized game AI across eight diverse games, based on assessment of 177,047 state-action pairs.

Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.

View on arXiv PDF

Similar