LGCVMay 27, 2025

Leaner Transformers: More Heads, Less Depth

arXiv:2505.20802v17 citationsh-index: 26
Originality Incremental advance
AI Analysis

This work addresses the problem of inefficiently large transformer models for researchers and practitioners, offering a method to reduce computational costs without performance loss, though it is incremental in redefining attention roles.

The paper challenges the 'bigger is better' trend in transformers by showing that increasing the number of attention heads improves conditioning, allowing depth reduction and 30-50% parameter cuts while maintaining accuracy across tasks like ImageNet-1k and GLUE.

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes