CV AIMar 17

360Â° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani

arXiv:2603.1617979.7h-index: 6

Predicted impact top 29% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses a domain-specific problem for researchers and practitioners in computer vision and AI, focusing on enhancing 360° image understanding, but it is incremental as it builds on existing MLLM capabilities with a novel method for a known bottleneck.

The paper tackles the problem of Multimodal Large Language Models (MLLMs) struggling with 360° image perception due to geometric distortion and complex spatial relations, and introduces 360Bench, a benchmark with 7K-resolution images and seven tasks, revealing shortcomings in existing models. It proposes Free360, a training-free framework that improves its base MLLM for 360° VQA tasks, providing a strong solution without additional training.

Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360Â° images remains largely underexplored. Unlike conventional images, 360Â° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs' capabilities to perceive 360Â° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360Â° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360Â° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360Â° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360Â° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360Â° VQA tasks. The source code and dataset will be publicly released upon acceptance.

View on arXiv PDF

Similar