CLCVJun 3, 2025

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

arXiv:2506.02803v32 citationsh-index: 19EMNLP
Originality Incremental advance
AI Analysis

This reveals a critical architectural flaw in VLMs that limits their real-world robustness for applications like medical imaging and security, though the solution is incremental.

The paper tackles the problem that vision-language models (VLMs) fail to detect hidden content in optical illusions or AI-generated images, achieving near-zero accuracy (0-5.36%) on a new benchmark (HC-Bench), and shows that simply scaling images to low resolutions (32-128 pixels) unlocks >99% accuracy by eliminating visual noise.

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes