Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
This work addresses efficiency and accuracy issues in object detection for computer vision applications, representing an incremental improvement over existing DETR frameworks.
The paper tackles the computational burden and scale bias in two-stage DETR methods by proposing hierarchical salience filtering refinement, achieving improvements of up to +4.4% AP on detection datasets and 49.2% AP on COCO 2017 with reduced FLOPs.
DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention, which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues, we propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries, for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries, we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements, the proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets, as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.