CVJun 30, 2025

Low-latency vision transformers via large-scale multi-head attention

Ronit D. Gross, Tal Halevi, Ella Koresh, Yarden Tzach, Ido Kanter

arXiv:2506.23832v111.85 citationsh-index: 6Physica A: Statistical Mechanics and its Applications

Originality Incremental advance

AI Analysis

This work addresses latency issues in vision transformers for computer vision applications, but it is incremental as it builds on known mechanisms and focuses on specific architectures.

The paper tackled the problem of improving vision transformer efficiency by analyzing multi-head attention mechanisms, achieving a significant reduction in latency without affecting accuracy on CIFAR-100.

The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.

View on arXiv PDF

Similar