QKVA grid: Attention in Image Perspective and Stacked DETR
This work addresses efficiency and accuracy issues in object detection models, particularly for small objects, but is incremental as it builds directly on DETR.
The paper tackles the high training cost and performance limitations of DETR for object detection by introducing a stacked architecture and a QKVA grid perspective, resulting in improved performance (+0.6 AP overall, +2.7 AP for small objects) compared to DETR and Faster R-CNN.
We present a new model named Stacked-DETR(SDETR), which inherits the main ideas in canonical DETR. We improve DETR in two directions: simplifying the cost of training and introducing the stacked architecture to enhance the performance. To the former, we focus on the inside of the Attention block and propose the QKVA grid, a new perspective to describe the process of attention. By this, we can step further on how Attention works for image problems and the effect of multi-head. These two ideas contribute the design of single-head encoder-layer. To the latter, SDETR reaches better performance(+0.6AP, +2.7APs) to DETR. Especially to the performance on small objects, SDETR achieves better results to the optimized Faster R-CNN baseline, which was a shortcoming in DETR. Our changes are based on the code of DETR. Training code and pretrained models are available at https://github.com/shengwenyuan/sdetr.