Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs
This addresses the challenge of spatially or contextually demanding queries in interactive applications like virtual and augmented reality, though it is incremental as it builds on existing LVLM and prompting techniques.
The paper tackles the problem of limited scene understanding in large vision-language models (LVLMs) when using only first-person views by integrating third-person views to provide global context, resulting in performance gains of 4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash over a baseline.
Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. The dataset and source code are available at https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding.