LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
This provides a novel benchmark for AI researchers to test LMMs in dynamic settings, bridging the gap between evaluation and interactive entertainment, though it is incremental as it adapts existing game environments for benchmarking.
The paper tackles the problem of evaluating large multimodal models (LMMs) in real-time, adversarial environments by introducing LM Fight Arena, a framework that pits models against each other in the fighting game Mortal Kombat II, resulting in a fully automated and objective assessment of strategic reasoning capabilities.
Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.