Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
This work highlights a critical limitation in grounding VLMs for 3D and multi-view spatial reasoning, which is important for applications in robotics and augmented reality, though it is incremental as it builds on existing benchmarks and diagnostic methods.
The paper tackled the problem of Vision-Language Models (VLMs) struggling with 3D spatial understanding by evaluating them on relative camera pose estimation, revealing that most VLMs fail to generalize beyond 2D heuristics, with state-of-the-art models like GPT-5 scoring 0.64 compared to geometric baselines at 0.97 and human performance at 0.92.
Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.