CVFeb 2

Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies

Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung-Levy, Hehe Fan

arXiv:2602.01816v16.05 citationsh-index: 19Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a critical gap in robustness testing for MLLMs, which is important for researchers and developers aiming to advance artificial general intelligence, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the problem of evaluating Multimodal Large Language Models (MLLMs) on visual illusions and anomalies, introducing VIA-Bench with over 1K question-answer pairs and finding significant vulnerabilities in over 20 state-of-the-art models, with Chain-of-Thought reasoning offering negligible robustness.

Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages'' where the model's logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.

View on arXiv PDF

Similar