CVMar 10

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

arXiv:2603.09573v188.22 citationsh-index: 10Has Code
Predicted impact top 14% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the need for better panoramic scene understanding in applications like autonomous driving, though it is incremental as it builds on existing pinhole-based models.

The paper tackles the problem of vision-language models being limited to pinhole imagery by introducing Panorama-Language Modeling (PLM) for unified 360° reasoning, achieving superior robustness and holistic understanding in adverse omni-scenes.

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes