Armour: Generalizable Compact Self-Attention for Vision Transformers
This addresses the efficiency and performance gap for vision transformer users, but it is incremental as it builds on existing attention optimizations.
The paper tackled the problem of compact vision transformers falling short in accuracy, model size, and throughput compared to convnets by introducing a generalizable compact self-attention mechanism that reduces redundancy and improves efficiency. The result was smaller and faster models with the same or better accuracies.
Attention-based transformer networks have demonstrated promising potential as their applications extend from natural language processing to vision. However, despite the recent improvements, such as sub-quadratic attention approximation and various training enhancements, the compact vision transformers to date using the regular attention still fall short in comparison with its convnet counterparts, in terms of \textit{accuracy,} \textit{model size}, \textit{and} \textit{throughput}. This paper introduces a compact self-attention mechanism that is fundamental and highly generalizable. The proposed method reduces redundancy and improves efficiency on top of the existing attention optimizations. We show its drop-in applicability for both the regular attention mechanism and some most recent variants in vision transformers. As a result, we produced smaller and faster models with the same or better accuracies.