CVMMJan 6, 2024

3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

arXiv:2401.03201v249 citationsh-index: 8Has Code2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)
AI Analysis

This work addresses the problem of 3D scene understanding for AI systems, which is incremental as it builds on existing MLLM capabilities by adapting them to 3D data.

The paper tackles the challenge of enabling multi-modal large language models to understand 3D scenes by collecting a dataset of 75K instruction-response pairs and introducing 3DMIT, a prompt tuning paradigm that eliminates alignment stages and extends prompts with 3D information, resulting in enhanced performance across tasks like 3D VQA, grounding, and conversation.

The remarkable potential of multi-modal large language models (MLLMs) in comprehending both vision and language information has been widely acknowledged. However, the scarcity of 3D scenes-language pairs in comparison to their 2D counterparts, coupled with the inadequacy of existing approaches in understanding of 3D scenes by LLMs, poses a significant challenge. In response, we collect and construct an extensive dataset comprising 75K instruction-response pairs tailored for 3D scenes. This dataset addresses tasks related to 3D VQA, 3D grounding, and 3D conversation. To further enhance the integration of 3D spatial information into LLMs, we introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information including the entire scene and segmented objects. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain and find that our approach serves as a strategic means to enrich LLMs' comprehension of the 3D world. Our code is available at https://github.com/staymylove/3DMIT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes