CVFeb 13, 2025

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong

CambridgeOxford

arXiv:2502.09696v232.344 citationsh-index: 37

Originality Highly original

AI Analysis

This benchmark addresses the need for more challenging evaluations of LMMs, which currently surpass human-level performance on many popular visual benchmarks but still exhibit major shortcomings in visual reasoning.

The authors introduced ZeroBench, a visual reasoning benchmark that contemporary large multimodal models (LMMs) fail to solve, with all 20 evaluated LMMs scoring 0.0%. This highlights the significant shortfalls of LMMs in spatial cognition and visual understanding.

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

View on arXiv PDF

Similar