CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
This provides a more nuanced evaluation tool for researchers and developers working on LLM agents in complex, multi-agent environments, though it is incremental as it builds on existing benchmarking approaches.
The paper tackled the problem of evaluating strategic decision-making in LLM-based agents by introducing CivBench, a benchmark using Civilization V, which estimates victory probabilities throughout gameplay to provide richer evaluation signals than sparse win/loss outcomes. The result demonstrated its potential to assess strategic capabilities across 307 games with 7 LLMs, revealing model-specific effects and distinct strategic profiles.
Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.