CVMay 19, 2025

VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection

arXiv:2505.12715v11 citationsh-index: 21
Originality Highly original
AI Analysis

This work addresses robust object detection for autonomous driving and military applications by adaptively fusing sensor modalities, representing a novel method for a known bottleneck.

The paper tackled the problem of robust object detection by addressing the inability of existing sensor fusion methods to adaptively weight modalities under varying environmental conditions, introducing VLC Fusion which uses a Vision-Language Model to condition fusion on environmental cues like darkness and rain, resulting in improved detection accuracy on autonomous driving and military datasets.

Although fusing multiple sensor modalities can enhance object detection performance, existing fusion approaches often overlook subtle variations in environmental conditions and sensor inputs. As a result, they struggle to adaptively weight each modality under such variations. To address this challenge, we introduce Vision-Language Conditioned Fusion (VLC Fusion), a novel fusion framework that leverages a Vision-Language Model (VLM) to condition the fusion process on nuanced environmental cues. By capturing high-level environmental context such as as darkness, rain, and camera blurring, the VLM guides the model to dynamically adjust modality weights based on the current scene. We evaluate VLC Fusion on real-world autonomous driving and military target detection datasets that include image, LIDAR, and mid-wave infrared modalities. Our experiments show that VLC Fusion consistently outperforms conventional fusion baselines, achieving improved detection accuracy in both seen and unseen scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes