How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?
This work addresses the impact of alignment strategies in self-supervised learning for multi-sensor data, providing insights for researchers in computer vision and robotics, though it is incremental as it builds on existing contrastive learning methods.
The study investigated how aligning representations across views and modalities in contrastive learning affects visual features from images and point clouds, finding that cross-modal alignment discards complementary information like color and texture while emphasizing redundant depth cues, which improves downstream depth prediction performance and leads to more robust encoders.
Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance segmentation, and object detection.