AI CL CVSep 30, 2024

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan

arXiv:2410.00255v220.524 citationsh-index: 15Has Code

Originality Highly original

AI Analysis

This work addresses the problem of limited discriminative power and generalization in 3DLLMs for building general-purpose agents in the 3D real world, offering significant improvements for researchers and developers in this domain.

The paper introduces Robin3D, a 3D Large Language Model (3DLLM) trained on 1 million instruction-following data generated by a novel Robust Instruction Generation (RIG) engine. Robin3D achieves a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap) compared to previous methods.

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

View on arXiv PDF Code

Similar