CVAIJun 23, 2022

YOLOSA: Object detection based on 2D local feature superimposed self-attention

arXiv:2206.11825v210 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work addresses the need for more accurate and efficient object detection models, particularly for real-time applications, though it appears incremental as it builds upon existing YOLO architectures.

The paper tackled the problem of improving detection accuracy and inference efficiency in real-time object detection by proposing a novel self-attention module for the feature concatenation stage, achieving state-of-the-art results with average precisions of 49.0%, 46.1%, and 39.1% for large, medium, and small-scale models, respectively.

We analyzed the network structure of real-time object detection models and found that the features in the feature concatenation stage are very rich. Applying an attention module here can effectively improve the detection accuracy of the model. However, the commonly used attention module or self-attention module shows poor performance in detection accuracy and inference efficiency. Therefore, we propose a novel self-attention module, called 2D local feature superimposed self-attention, for the feature concatenation stage of the neck network. This self-attention module reflects global features through local features and local receptive fields. We also propose and optimize an efficient decoupled head and AB-OTA, and achieve SOTA results. Average precisions of 49.0% (71FPS, 14ms), 46.1% (85FPS, 11.7ms), and 39.1% (107FPS, 9.3ms) were obtained for large, medium, and small-scale models built using our proposed improvements. Our models exceeded YOLOv5 by 0.8% -- 3.1% in average precision.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes