CVJul 18, 2022

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

DeepMind
arXiv:2207.08954v10.56131 citationsh-index: 104Has Code
AI Analysis55

This addresses the annotation bottleneck for object detection, enabling scaling to more categories with less labeled data, though it is incremental as it builds on existing vision-language models.

The paper tackles the high cost of large-scale object detection annotations by using vision and language models to generate pseudo labels from unlabeled images, achieving state-of-the-art results in open-vocabulary detection and improving semi-supervised detection.

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes