CVAIMar 2

physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

arXiv:2603.01947v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust object detection in open water environments for USVs, which is incremental as it builds on existing fusion methods with physics-informed enhancements.

The paper tackled the problem of detecting water-surface targets for Unmanned Surface Vehicles (USVs) by proposing PhysFusion, a physics-informed radar-image fusion framework, which achieved 59.7% mAP50:95 and 90.3% mAP50 on the WaterScenes dataset and 94.8% mAP50 and 46.2% mAP50:95 on FLOW.

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes