CVAIApr 17

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

arXiv:2604.1605477.6h-index: 12Has Code
Predicted impact top 32% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers evaluating multimodal LLMs, this benchmark highlights fundamental limitations in visuospatial reasoning compared to humans.

The paper introduces Mind's Eye, a benchmark for evaluating visuospatial reasoning in multimodal LLMs. Humans achieve 80% accuracy, while top models score below 50%, revealing significant gaps in visual abstraction and transformation abilities.

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes