CVNov 13, 2025

Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

arXiv:2511.09955v1h-index: 12
Originality Incremental advance
AI Analysis

It addresses the high cost of manual labeling in autonomous driving by enabling robust training with noisy pseudo-labels, though it is incremental as it builds on existing VLM and YOLO methods.

This paper tackles the problem of training efficient object detectors for autonomous driving by using vision-language models (VLMs) to generate pseudo-labels, with a per-object co-teaching strategy to filter noise. It achieves a significant mAP@0.5 boost from 31.12% to 46.61% on KITTI and up to 57.97% with some ground truth labels.

Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes