LG CVMay 27, 2025

Leaner Transformers: More Heads, Less Depth

Hemanth Saratchandran, Damien Teney, Simon Lucey

arXiv:2505.20802v114.47 citationsh-index: 26

Originality Incremental advance

AI Analysis

This work addresses the problem of inefficiently large transformer models for researchers and practitioners, offering a method to reduce computational costs without performance loss, though it is incremental in redefining attention roles.

The paper challenges the 'bigger is better' trend in transformers by showing that increasing the number of attention heads improves conditioning, allowing depth reduction and 30-50% parameter cuts while maintaining accuracy across tasks like ImageNet-1k and GLUE.

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

View on arXiv PDF

Similar