RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding
This work addresses the need for reliable radar perception in autonomous driving across adverse conditions, offering a novel framework that improves spatial accuracy and representation learning, though it is incremental in advancing vision-language models for a specific domain.
The paper tackled the problem of fragmented and task-specific machine learning approaches for radar scene understanding by introducing RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision, achieving up to 50% relative F1-score improvement in generative captioning and a 21% AP gain on vehicle segmentation.
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose localization-aware evaluation metrics that directly assess spatial accuracy beyond traditional linguistic similarity measures. Validated on generative captioning and vehicle segmentation, SG-CLIP achieves up to 50\% relative F1-score improvement over vanilla CLIP and a 21\% AP gain on segmentation, demonstrating that language grounding produces spatially structured representations.