CVDec 28, 2025

CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki

arXiv:2512.22969v1h-index: 22

Originality Incremental advance

AI Analysis

This addresses robustness issues in object detection for computer vision applications, though it is incremental as it builds on existing CLIP and detector frameworks.

The paper tackles the problem of class imbalance and label noise in object detectors by integrating CLIP-style contrastive vision-language supervision through end-to-end joint training, achieving consistent and substantial improvements on Pascal VOC and MS COCO benchmarks while preserving real-time inference speed.

Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.

View on arXiv PDF

Similar