CVJul 12, 2021

Locally Enhanced Self-Attention: Combining Self-Attention and Convolution as Local and Context Terms

arXiv:2107.05637v31 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more effective attention mechanisms in vision models, offering an incremental improvement by enhancing self-attention with convolutional elements for better task performance.

The authors tackled the problem of improving self-attention in computer vision by decomposing it into local and context terms, proposing Locally Enhanced Self-Attention (LESA) that combines convolutions with attention mechanisms, resulting in superior performance on ImageNet and COCO for image recognition, object detection, and instance segmentation compared to baseline methods.

Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose self-attention into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes