CVApr 7, 2024

Hyperbolic Learning with Synthetic Captions for Open-World Detection

arXiv:2404.05016v115 citationsh-index: 3CVPR
Originality Highly original
AI Analysis

This work addresses the problem of expensive data annotation for open-world detection, offering a more scalable solution for detecting novel objects in images.

The paper tackles the challenge of open-world detection by generating synthetic captions with vision-language models to avoid costly manual annotation, and introduces a hyperbolic learning method to reduce noise from caption hallucinations. The proposed HyperLearner detector outperforms state-of-the-art methods like GLIP and Grounding DINO across multiple benchmarks, achieving consistent gains with the same backbone.

Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes