CLLGOct 26, 2024

Dynamic layer selection in decoder-only transformers

arXiv:2410.20022v13 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses efficiency for users of large language models, but it is incremental as it builds on existing dynamic inference approaches.

The paper tackles the problem of optimizing inference in large language models by evaluating dynamic inference methods like layer skipping and early exiting, finding that layer skipping is more robust and that an oracle controller can achieve equal performance using only 23.3% of layers on average.

The vast size of Large Language Models (LLMs) has prompted a search to optimize inference. One effective approach is dynamic inference, which adapts the architecture to the sample-at-hand to reduce the overall computational cost. We empirically examine two common dynamic inference methods for natural language generation (NLG): layer skipping and early exiting. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping, as opposed to early exit. We demonstrate the difficulty of using hidden state information to adapt computation on a per-token basis for layer skipping. Finally, we show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains by constructing an oracle controller. Remarkably, we find that there exists an allocation which achieves equal performance to the full model using only 23.3% of its layers on average.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes