CVAug 28, 2022

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

arXiv:2208.13138v19 citationsh-index: 80
Originality Highly original
AI Analysis

This addresses efficiency challenges in vision Transformers for dense prediction tasks, offering a novel method to reduce computational overhead while maintaining performance.

The paper tackles the quadratic computational complexity of vision Transformers by proposing ClusTR, a content-based sparse attention method that clusters key and value tokens to reduce token count while retaining long-range dependencies. It achieves state-of-the-art performance with lower cost, e.g., a small model with 22.7M parameters attains 83.2% Top-1 accuracy on ImageNet.

Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2\% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes