CVMay 19, 2024

Unifying 3D Vision-Language Understanding via Promptable Queries

arXiv:2405.11442v280 citationsh-index: 25ECCV
Originality Incremental advance
AI Analysis

This addresses the gap in unified models for 3D-VL tasks, benefiting researchers and practitioners in computer vision and robotics, though it is incremental as it builds on existing query-based methods.

The paper tackles the problem of unifying 3D vision-language understanding by introducing PQ3D, a model that uses promptable queries to handle various tasks, achieving state-of-the-art improvements such as 4.9% on ScanNet200 and 13.4% on Scan2Cap.

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes