CVFeb 22

Direction-aware 3D Large Multimodal Models

Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu

arXiv:2602.19063v11.5h-index: 26

Originality Incremental advance

AI Analysis

This work addresses a critical bottleneck for researchers and practitioners in 3D vision and robotics by providing a rigorous paradigm for direction-aware spatial reasoning, though it is incremental as it builds on existing 3D LMM backbones.

The paper tackles the problem of enabling direction-aware 3D large multimodal models by identifying and supplementing missing ego poses in point cloud benchmarks, resulting in improvements such as a 30.0% increase in ScanRefer mIoU and an 11.7% boost in Scan2Cap LLM-as-judge accuracy.

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

View on arXiv PDF

Similar