Dense360: Dense Understanding from Omnidirectional Panoramas
It addresses the need for comprehensive visual inputs in MLLMs for applications like robotics or VR, but is incremental as it adapts existing methods to a new data format.
The paper tackles the problem of dense visual understanding from omnidirectional panoramas for Multimodal Large Language Models (MLLMs), introducing a dataset with 160K panoramas and 5M captions, and proposing ERP-RoPE to address challenges in equirectangular projections, resulting in the first benchmark for panoramic captioning and grounding.
Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.