CVFeb 13, 2025

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

CambridgeOxford
arXiv:2502.09696v241 citationsh-index: 37
AI Analysis

This benchmark addresses the need for more challenging evaluations of LMMs, which currently surpass human-level performance on many popular visual benchmarks but still exhibit major shortcomings in visual reasoning.

The authors introduced ZeroBench, a visual reasoning benchmark that contemporary large multimodal models (LMMs) fail to solve, with all 20 evaluated LMMs scoring 0.0%. This highlights the significant shortfalls of LMMs in spatial cognition and visual understanding.

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes