CVJul 18, 2022

Conditional DETR V2: Efficient Detection Transformer with Box Queries

arXiv:2207.08914v142 citationsh-index: 45
Originality Incremental advance
AI Analysis

This work provides an incremental improvement for object detection researchers and practitioners by enhancing the speed and memory efficiency of transformer-based detectors.

The paper tackles the problem of improving the efficiency and detection quality of Detection Transformers (DETR) for object detection by reformulating object queries into box queries and learning them from image content, achieving 44.8 AP with 16.4 FPS on COCO, which is 1.6x faster, saves 74% memory, and improves AP by 1.0 compared to Conditional DETR.

In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes