Image-to-Lidar Relational Distillation for Autonomous Driving Data
This work addresses the problem of enhancing 3D model representations for autonomous driving applications, offering a novel distillation approach that improves performance in zero-shot and few-shot learning scenarios, though it is incremental relative to existing 2D-to-3D distillation frameworks.
The paper tackles the problem of distilling 2D foundation model representations to 3D models for autonomous driving datasets, which faces challenges like structural mismatches and poor performance in zero-shot and few-shot learning. The result is a relational distillation framework that improves 3D representation performance, outperforming contrastive and similarity-based methods in zero-shot and few-shot 3D semantic segmentation tasks.
Pre-trained on extensive and diverse multi-modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero-shot learning contexts. Whereas other methodologies, such as similarity-based distillation, enhance zero-shot performance, they tend to yield less discriminative representations, diminishing few-shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero-shot and few-shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero-shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in-distribution and out-of-distribution few-shot segmentation tasks, outperforming approaches that rely on the similarity loss.