RO CVJan 9, 2025

Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection

Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, Anna Choromanska

arXiv:2501.04969v212.36 citationsh-index: 3Has Code

Originality Highly original

AI Analysis

This work addresses the problem of efficient and effective self-supervised pre-training for autonomous driving systems, offering a novel approach that improves performance and reduces computational costs, though it is incremental as it builds on existing JEPA concepts applied to a new domain.

The paper tackles the problem of self-supervised representation learning for automotive LiDAR object detection by introducing AD-L-JEPA, a novel pre-training framework that avoids generative or contrastive methods and instead uses a joint embedding predictive architecture to predict Bird's-Eye-View embeddings, resulting in consistent improvements across datasets, such as a 1.61 mAP gain on ONCE with 100K frames and reductions in GPU hours by 1.9x-2.7x and memory by 2.8x-4x compared to state-of-the-art methods.

Recently, self-supervised representation learning relying on vast amounts of unlabeled data has been explored as a pre-training method for autonomous driving. However, directly applying popular contrastive or generative methods to this problem is insufficient and may even lead to negative transfer. In this paper, we present AD-L-JEPA, a novel self-supervised pre-training framework with a joint embedding predictive architecture (JEPA) for automotive LiDAR object detection. Unlike existing methods, AD-L-JEPA is neither generative nor contrastive. Instead of explicitly generating masked regions, our method predicts Bird's-Eye-View embeddings to capture the diverse nature of driving scenes. Furthermore, our approach eliminates the need to manually form contrastive pairs by employing explicit variance regularization to avoid representation collapse. Experimental results demonstrate consistent improvements on the LiDAR 3D object detection downstream task across the KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE. Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61 mAP gain, better than all other methods pre-trained on either 100K or 500K frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes the first JEPA-based pre-training method for autonomous driving. It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning. The source code of AD-L-JEPA is ready to be released.

View on arXiv PDF Code

Similar