CVMay 29, 2025

Are MLMs Trapped in the Visual Room?

Yazhou Zhang, Chunwang Zou, Qimeng Liu, Lu Rong, Ben Yao, Zheng Lian, Qiuchi Li, Peng Zhang, Jing Qin

arXiv:2505.23272v28.42 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses the problem of evaluating genuine understanding in MLMs for AI researchers, providing empirical grounding for a philosophical argument and a new evaluation paradigm.

The paper tackles whether multimodal large models (MLMs) can genuinely understand images by proposing the Visual Room argument, which suggests they may process visual details without true comprehension. Results show MLMs achieve high accuracy in visual perception but have an average error rate of ~17.1% in sarcasm understanding, revealing a significant gap.

Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs demonstrate high accuracy in visual perception; (2) even with correct perception, MLMs exhibit an average error rate of ~17.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) this gap stems from weaknesses in context integration, emotional reasoning, and pragmatic inference. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

View on arXiv PDF

Similar