CLMar 11, 2025

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

arXiv:2503.08524v12 citationsh-index: 8ACL
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users deploying LLMs, though it is incremental as it builds on dynamic computation methods.

The paper tackles the problem of resource-intensive inference in Large Language Models (LLMs) by proposing a training-free layer skipping framework, achieving an average 1.5x speedup with less than 1% performance drop on benchmarks like GSM8K and BBH.

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes