CVDec 24, 2020

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Yunze Liu, Li Yi, Shanghang Zhang, Qingnan Fan, Thomas Funkhouser, Hao Dong

arXiv:2012.13089v119.665 citations

Originality Incremental advance

AI Analysis

This work provides a novel self-supervised pretraining method for RGB-D scene understanding, offering an incremental improvement for researchers working with multi-modal 3D data.

This paper addresses the lack of self-supervised contrastive learning for multi-modal RGB-D scans aimed at high-level scene understanding. The authors propose a method that contrasts 'pairs of point-pixel pairs', where positive pairs are corresponding RGB-D points and negative pairs involve disturbed modalities or non-corresponding points. This approach led to improved performance on ScanNet, SUN RGB-D, and 3RScan benchmarks compared to prior pretraining methods.

Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks. A promising approach is to use contrastive learning to learn a latent space where features are close for similar data samples and far apart for dissimilar ones. This approach has demonstrated tremendous success for pretraining both image and point cloud feature extractors, but it has been barely investigated for multi-modal RGB-D scans, especially with the goal of facilitating high-level scene understanding. To solve this problem, we propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed and/or the two RGB-D points are not in correspondence. This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two. Experiments show that this proposed approach yields better performance on three large-scale RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than previous pretraining approaches.

View on arXiv PDF

Similar