CVJul 8, 2024

3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu

arXiv:2407.06084v118.221 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of data scarcity for researchers in embodied AI, though it is incremental by using synthetic data to augment existing methods.

The paper tackles the limited diversity and annotations in 3D vision-language pre-training by constructing SynVL3D, a synthetic dataset with 10K scenes and 1M descriptions, and achieves state-of-the-art performance on tasks like visual grounding and question answering.

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.

View on arXiv PDF Code

Similar