CV CL LGMay 27, 2025

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

arXiv:2505.20612v424.822 citationsh-index: 79Has Code

Originality Incremental advance

AI Analysis

This work addresses the generalization gap in vision-language models for object detection across diverse domains, providing a benchmark to evaluate and improve model performance in few-shot and other data regimes.

The paper tackles the problem of vision-language models struggling to generalize to out-of-distribution object detection tasks by introducing Roboflow100-VL, a benchmark with 100 multi-modal datasets, and finds that state-of-the-art models achieve less than 2% zero-shot accuracy on challenging medical imaging datasets, highlighting the need for few-shot concept alignment.

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

View on arXiv PDF Code

Similar