CV AIMay 11

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

arXiv:2605.0988384.9

AI Analysis

For researchers evaluating MLLMs, this work exposes a systematic vulnerability in current benchmarks and proposes a more robust evaluation method.

The paper identifies a 'Cartesian Shortcut' where MLLMs exploit orthogonal grid-based layouts in visual reasoning benchmarks. To address this, they introduce Polaris-Bench, reformulating 53 tasks in Polar coordinates, and show that frontier models drop from 70-83% accuracy on Cartesian to 31-39% on Polar equivalents, revealing a lack of topology-invariant reasoning.

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

View on arXiv PDF

Similar