Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
This addresses a gap in immersive applications like augmented reality and embodied AI, but it is incremental as it builds on existing MLLM methods with specific adaptations for panoramic scenes.
The paper tackles the problem of limited multi-modal large language model (MLLM) capabilities in understanding and reasoning about omnidirectional images, by introducing the OmniVQA dataset and benchmark, and proposes a GRPO-based method called 360-R1 that achieves a 6% improvement in performance on this task.
Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).