CVLGApr 6

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

arXiv:2604.0479722.8
AI Analysis

This addresses the problem of accurate perception for autonomous vehicles, but it is incremental as it builds on existing methods like BEVDepth and RadarBEVNet.

The paper tackles 3D object detection for autonomous driving by proposing MMF-BEV, a radar-camera fusion framework that outperforms unimodal baselines and achieves competitive results against prior fusion methods on the VoD dataset.

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes