Read, look and detect: Bounding box annotation from image-caption pairs
This addresses the expensive data annotation problem in computer vision by enabling object detection with weaker supervision, though it is incremental as it builds on existing vision-language models.
The paper tackles the problem of reducing annotation costs for object detection by proposing a method that uses image-caption pairs as weak supervision, achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO.
Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categories. In this paper, we propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs. By leveraging recent advances in vision-language (VL) models and self-supervised vision transformers (ViTs), our method is able to perform phrase grounding and object detection in a weakly supervised manner. Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and establishing a new state-of-the-art in object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when exclusively relying on image-caption pairs.