Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion
This work improves object detection for autonomous systems by handling uncertainties in sensor data, though it is incremental as it builds on adaptive fusion approaches.
The paper tackles the problem of robust 3D object detection in multi-modal perception by addressing uncertainties in sensor fusion, resulting in a framework that consistently outperforms existing methods in normal and challenging conditions.
An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.