CVSep 2, 2025

Omnidirectional Spatial Modeling from Correlated Panoramas

arXiv:2509.02164v15 citationsh-index: 1MMAsia
Originality Highly original
AI Analysis

This addresses the problem of holistic 360° scene understanding for applications like embodied AI and autonomous driving, representing a novel benchmark and method rather than incremental progress.

The paper tackles the challenge of omnidirectional scene understanding by introducing CFpano, the first benchmark dataset for cross-frame correlated panoramas visual question answering, and a multi-modal large language model (MLLM) that achieves state-of-the-art performance with a +5.37% overall improvement.

Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360° imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbf{CFpano}, the \textbf{first} benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360° scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf{+5.37\%} in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes