Lightweight Structure-Aware Attention for Visual Understanding
This addresses efficiency and performance issues in visual models for researchers and practitioners, though it is incremental as it builds on existing attention mechanisms.
The paper tackled the limitations of attention operators in visual understanding, such as high redundancy and quadratic complexity, by proposing Lightweight Structure-aware Attention (LiSA), which achieved state-of-the-art results on ImageNet-1K and other tasks like Kinetics-400 and COCO.
Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity. Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain log-linear complexity. Our experiments and analyses demonstrate that the proposed operator outperforms self-attention and other existing operators, achieving state-of-the-art results on ImageNet-1K and other downstream tasks such as video action recognition on Kinetics-400, object detection \& instance segmentation on COCO, and semantic segmentation on ADE-20K.