CVMay 25

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

arXiv:2605.2536473.9
Predicted impact top 37% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers evaluating multimodal large language models, this benchmark provides a focused diagnostic to assess reasoning grounded in visual evidence, exposing limitations not captured by existing benchmarks.

VisReason is a benchmark of 1,505 questions across 10 categories for evaluating vision-centric reasoning in everyday scenarios. It reveals substantial gaps between humans and current MLLMs, with limited benefits from test-time reasoning strategies.

Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes