CVMay 23

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

arXiv:2605.2445653.3
Predicted impact top 6% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This benchmark addresses the need for evaluating embodied 3D reasoning in MLLMs, a critical capability for real-world applications.

The paper introduces EgoProx, a benchmark for evaluating multimodal large language models on egocentric 3D proximity reasoning across a cognitive hierarchy. Results show that while MLLMs exhibit some spatial knowledge, they struggle to effectively leverage it for spatial reasoning VQA.

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes