CVAug 13, 2024

FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

Yutao Zhu, Xiaosong Jia, Xinyu Yang, Junchi Yan

arXiv:2408.06832v211.312 citationsh-index: 22

Originality Incremental advance

AI Analysis

This addresses sensor fusion for autonomous driving, but it is incremental as it builds on prior sparse Transformer methods.

The paper tackled the problem of effectively fusing camera and LiDAR data in autonomous driving by exploring design choices for Transformer-based sparse fusion, resulting in FlatFusion, which achieved 73.7 NDS on nuScenes with 10.1 FPS, outperforming existing methods.

The integration of data from diverse sensor modalities (e.g., camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.

View on arXiv PDF

Similar