CVAug 2, 2023

Revisiting DETR Pre-training for Object Detection

Yan Ma, Weicong Liang, Bohan Chen, Yiduo Hao, Bojian Hou, Xiangyu Yue, Chao Zhang, Yuhui Yuan

Berkeley

arXiv:2308.01300v26.88 citationsh-index: 25

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving object detection accuracy for computer vision researchers, but it is incremental as it builds on existing DETR frameworks and pre-training methods.

The paper tackles the problem of self-supervised pre-training for DETR-based object detection models, finding that existing methods like DETReg are ineffective for robust models under full data conditions, and proposes an optimized approach called Simple Self-training that achieves a 59.3% AP score on COCO, outperforming a baseline by 1.4%.

Motivated by the remarkable achievements of DETR-based approaches on COCO object detection and segmentation benchmarks, recent endeavors have been directed towards elevating their performance through self-supervised pre-training of Transformers while preserving a frozen backbone. Noteworthy advancements in accuracy have been documented in certain studies. Our investigation delved deeply into a representative approach, DETReg, and its performance assessment in the context of emerging models like $\mathcal{H}$-Deformable-DETR. Regrettably, DETReg proves inadequate in enhancing the performance of robust DETR-based models under full data conditions. To dissect the underlying causes, we conduct extensive experiments on COCO and PASCAL VOC probing elements such as the selection of pre-training datasets and strategies for pre-training target generation. By contrast, we employ an optimized approach named Simple Self-training which leads to marked enhancements through the combination of an improved box predictor and the Objects$365$ benchmark. The culmination of these endeavors results in a remarkable AP score of $59.3\%$ on the COCO val set, outperforming $\mathcal{H}$-Deformable-DETR + Swin-L without pre-training by $1.4\%$. Moreover, a series of synthetic pre-training datasets, generated by merging contemporary image-to-text(LLaVA) and text-to-image (SDXL) models, significantly amplifies object detection capabilities.

View on arXiv PDF

Similar