CLAIJan 4, 2025

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

arXiv:2501.02336v113 citationsh-index: 25AAAI
Originality Incremental advance
AI Analysis

This work addresses computational efficiency for users of long-context LLMs, representing an incremental improvement over prior layer-wise skipping methods.

The paper tackles the problem of accelerating long-context large language model inference by proposing AdaSkip, an adaptive sublayer skipping method that addresses limitations of existing strategies, resulting in superior inference performance as demonstrated on various benchmarks and models.

Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness of \sysname is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes