CV AIJun 9, 2023

Read, look and detect: Bounding box annotation from image-caption pairs

arXiv:2306.06149v11.52 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the expensive data annotation problem in computer vision by enabling object detection with weaker supervision, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of reducing annotation costs for object detection by proposing a method that uses image-caption pairs as weak supervision, achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO.

Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categories. In this paper, we propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs. By leveraging recent advances in vision-language (VL) models and self-supervised vision transformers (ViTs), our method is able to perform phrase grounding and object detection in a weakly supervised manner. Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and establishing a new state-of-the-art in object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when exclusively relying on image-caption pairs.

View on arXiv PDF

Similar