CVMar 13, 2024

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan, Burhaneddin Yaman, Senem Velipasalar, Liu Ren

arXiv:2403.08919v216.430 citationsh-index: 34CVPR

Originality Incremental advance

AI Analysis

This work addresses a key problem in autonomous driving by improving 3D object detection accuracy, representing an incremental advancement in BEV-based perception systems.

The paper tackles the challenge of supervision loss in Bird's Eye View (BEV) elements for autonomous driving by introducing CLIP-BEVFormer, which uses contrastive learning to enhance multi-view image-based BEV detectors with ground truth flow, achieving improvements of 8.5% in NDS and 9.2% in mAP over the previous state-of-the-art on the nuScenes dataset.

Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.

View on arXiv PDF

Similar