CVMar 25, 2023

Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

arXiv:2303.14386v124 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and accurate detection of both base and novel object classes for computer vision applications, representing a strong incremental improvement.

The paper tackles open-vocabulary object detection by proposing Prompt-OVD, a framework that uses CLIP class embeddings as prompts to guide a Transformer decoder, achieving 21.2 times faster inference than OV-DETR and higher APs than comparable two-stage methods.

Prompt-OVD is an efficient and effective framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in both base and novel classes. Additionally, our novel RoI-based masked attention and RoI pruning techniques help leverage the zero-shot classification ability of the Vision Transformer-based CLIP, resulting in improved detection performance at minimal computational cost. Our experiments on the OV-COCO and OVLVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed than the first end-to-end open-vocabulary detection method (OV-DETR), while also achieving higher APs than four two-stage-based methods operating within similar inference time ranges. Code will be made available soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes