CVFeb 27, 2023

Aligning Bag of Regions for Open-Vocabulary Object Detection

arXiv:2302.13996v1175 citationsh-index: 128Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of detecting objects from novel categories in open-vocabulary settings, representing an incremental advance over existing methods by better utilizing scene structure.

The paper tackles the problem of open-vocabulary object detection by proposing a method to align embeddings of groups of regions (bags) rather than individual regions, leveraging compositional semantic structures from pre-trained vision-language models. It achieves improvements of 4.6 box AP50 and 2.8 mask AP on novel categories in COCO and LVIS benchmarks.

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes