CVMar 9, 2023

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Tsinghua
arXiv:2303.05499v54194 citationsh-index: 51Has Code
Originality Highly original
AI Analysis

This work addresses the problem of detecting arbitrary objects in images for applications like robotics and vision systems, representing a significant advance over closed-set detectors.

The paper tackles open-set object detection by integrating language inputs with a Transformer-based detector, enabling detection of arbitrary objects specified by category names or referring expressions. It achieves a 52.5 AP on COCO zero-shot transfer and sets a new record with 26.1 mean AP on the ODinW zero-shot benchmark.

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Code Implementations10 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes