HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
This work solves a domain-specific problem for surveillance systems by detecting idling vehicles, but it is incremental as it builds on existing cross-modal networks with decoupled heads.
The paper tackled the problem of idling vehicle detection in surveillance by addressing modality heterogeneity, scale variation, and training instability, resulting in improvements of 7.66 mAP over a disjoint baseline and 9.42 mAP over an end-to-end baseline.
Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model with simple CBAM-based bi-modal attention fails to handle these issues and often misses vehicles. We propose HAVT-IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HAVT-IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.