CVJan 12, 2023

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Anas Mahmoud, Jordan S. K. Hu, Tianshu Kuai, Ali Harakeh, Liam Paull, Steven L. Waslander

U of Toronto

arXiv:2301.05709v220.645 citationsh-index: 45Has Code

Originality Incremental advance

AI Analysis

This work addresses representation learning problems for 3D perception in autonomous driving, offering incremental improvements over existing methods.

The paper tackles the challenges of self-similarity and class imbalance in self-supervised image-to-point representation learning for autonomous driving, proposing a semantically tolerant contrastive loss and class-agnostic balanced loss that improve state-of-the-art 3D semantic segmentation performance across various pretrained models.

An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models.

View on arXiv PDF Code

Similar