Learned Fusion: 3D Object Detection using Calibration-Free Transformer Feature Fusion
This addresses the challenge of maintaining calibration in large-scale deployments for autonomous systems, representing a novel direction rather than an incremental improvement.
The paper tackles the problem of 3D object detection's reliance on sensor calibration by introducing the first calibration-free approach using transformers to fuse features across sensors, achieving a 14.1% improvement in BEV mAP over single-modal setups.
The state of the art in 3D object detection using sensor fusion heavily relies on calibration quality, which is difficult to maintain in large scale deployment outside a lab environment. We present the first calibration-free approach for 3D object detection. Thus, eliminating the need for complex and costly calibration procedures. Our approach uses transformers to map the features between multiple views of different sensors at multiple abstraction levels. In an extensive evaluation for object detection, we not only show that our approach outperforms single modal setups by 14.1% in BEV mAP, but also that the transformer indeed learns mapping. By showing calibration is not necessary for sensor fusion, we hope to motivate other researchers following the direction of calibration-free fusion. Additionally, resulting approaches have a substantial resilience against rotation and translation changes.