CVJul 8, 2024

3D Vision and Language Pretraining with Large-Scale Synthetic Data

arXiv:2407.06084v121 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the problem of data scarcity for researchers in embodied AI, though it is incremental by using synthetic data to augment existing methods.

The paper tackles the limited diversity and annotations in 3D vision-language pre-training by constructing SynVL3D, a synthetic dataset with 10K scenes and 1M descriptions, and achieves state-of-the-art performance on tasks like visual grounding and question answering.

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes