CVFeb 25

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

arXiv:2602.21992v11 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the need for better 3D spatial intelligence in VLMs for applications like virtual reality and robotics, though it is incremental as it builds on existing methods with specific enhancements.

The paper tackled the problem of Vision-Language Models struggling with 3D spatial reasoning on panoramic images due to distortion and limited supervision, by introducing a new benchmark and reinforcement learning framework that improved overall accuracy from 49.34% to 52.93% and open-ended accuracy to 14.83%.

360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

View on arXiv PDF

Similar