CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow
This work addresses a key problem in autonomous driving by improving 3D object detection accuracy, representing an incremental advancement in BEV-based perception systems.
The paper tackles the challenge of supervision loss in Bird's Eye View (BEV) elements for autonomous driving by introducing CLIP-BEVFormer, which uses contrastive learning to enhance multi-view image-based BEV detectors with ground truth flow, achieving improvements of 8.5% in NDS and 9.2% in mAP over the previous state-of-the-art on the nuScenes dataset.
Autonomous driving stands as a pivotal domain in computer vision, shaping the future of transportation. Within this paradigm, the backbone of the system plays a crucial role in interpreting the complex environment. However, a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation, we introduce CLIP-BEVFormer, a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive 8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the previous best BEV model on the 3D object detection task.