OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing
This addresses the evaluation gap for omni-modal AI models in interactive settings, which is crucial for advancing toward robust AGI, though it is incremental as it builds on existing benchmark methodologies.
The researchers tackled the problem of evaluating omni-modal models in dynamic, interactive environments by introducing OmniPlay, a diagnostic benchmark with five game environments that test cross-modal reasoning. Their evaluation of six leading models revealed a critical dichotomy: superhuman performance on memory tasks but systemic failures in reasoning and planning, with performance degrading under modality conflict.
While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive "less is more" paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.