Pic@Point: Cross-Modal Learning by Local and Global Point-Picture Correspondence
This addresses the problem of limited self-supervised learning advances in 3D data for researchers and practitioners in computer vision, though it appears incremental as it builds on existing contrastive learning ideas.
The paper tackles the challenge of self-supervised pre-training for 3D point clouds by introducing Pic@Point, a contrastive learning method that uses 2D-3D correspondences to guide representations, resulting in outperforming state-of-the-art methods on multiple 3D benchmarks.
Self-supervised pre-training has achieved remarkable success in NLP and 2D vision. However, these advances have yet to translate to 3D data. Techniques like masked reconstruction face inherent challenges on unstructured point clouds, while many contrastive learning tasks lack in complexity and informative value. In this paper, we present Pic@Point, an effective contrastive learning method based on structural 2D-3D correspondences. We leverage image cues rich in semantic and contextual knowledge to provide a guiding signal for point cloud representations at various abstraction levels. Our lightweight approach outperforms state-of-the-art pre-training methods on several 3D benchmarks.