CV LG IVDec 3, 2020

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

arXiv:2012.02206v132.6263 citations

Originality Highly original

AI Analysis

This work addresses the problem of automatically describing 3D objects in scanned scenes for applications like robotics and augmented reality, offering a substantial improvement over existing 2D methods.

This paper introduces the task of dense captioning in 3D scans, where the goal is to output bounding boxes and natural language descriptions for objects given a 3D point cloud. The proposed Scan2Cap method detects and describes 3D objects, outperforming 2D baselines by 27.61% CiDEr@0.5IoU.

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

View on arXiv PDF

Similar