CLAILGApr 12, 2024

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

arXiv:2404.08634v38 citationsh-index: 71Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of computational inefficiency in LLMs for researchers and practitioners, offering a practical method for model compression, though it is incremental as it builds on existing transformer architectures.

The paper tackles structural inefficiency in large language models (LLMs) by identifying degenerate attention layers that collapse to near rank-one patterns, and proposes Inheritune, a training recipe that initializes compact models from useful early layers of larger pre-trained models, achieving models with significantly fewer layers that match or outperform their larger counterparts.

Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes