CVMar 2, 2022

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li, Shuguang Cui

arXiv:2203.00843v326.1111 citationsh-index: 33Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating accurate descriptions for objects in 3D scenes, which is important for applications like robotics and augmented reality, by providing a more efficient inference method without extra computational burden, though it is incremental as it builds on existing knowledge distillation techniques.

The paper tackles the problem of 3D dense captioning, where single-modal approaches using point clouds often produce unfaithful descriptions, by proposing X-Trans2Cap, a method that uses cross-modal knowledge transfer with a teacher-student framework to boost performance; it achieves significant improvements, with about +21 and +16 absolute CIDEr scores on ScanRefer and Nr3D datasets, respectively.

3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets, respectively.

View on arXiv PDF Code

Similar