CV AI LG MMAug 24, 2022

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Gongjie Zhang, Zhipeng Luo, Zichen Tian, Jingyi Zhang, Xiaoqin Zhang, Shijian Lu

arXiv:2208.11356v211.247 citationsh-index: 67

Originality Highly original

AI Analysis

This work addresses efficiency issues in object detection for computer vision applications, offering an incremental improvement over existing Transformer-based methods.

The paper tackles the high computational cost of using multi-scale features in Transformer-based object detectors by proposing Iterative Multi-scale Feature Aggregation (IMFA), which sparsely samples features from key locations, resulting in significant performance boosts with minimal overhead, such as improving DETR by 2.1% AP on COCO.

Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead.

View on arXiv PDF

Similar