CVLGIVDec 3, 2020

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

arXiv:2012.02206v1261 citations
AI Analysis

This work addresses the problem of automatically describing 3D objects in scanned scenes for applications like robotics and augmented reality, offering a substantial improvement over existing 2D methods.

This paper introduces the task of dense captioning in 3D scans, where the goal is to output bounding boxes and natural language descriptions for objects given a 3D point cloud. The proposed Scan2Cap method detects and describes 3D objects, outperforming 2D baselines by 27.61% CiDEr@0.5IoU.

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes