CLLGMay 23, 2025

DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

arXiv:2505.17420v11 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the efficiency problem for deploying LLMs in latency-sensitive scenarios, representing an incremental improvement over existing methods.

The paper tackles the high inference cost of large language models (LLMs) by proposing DASH, a dynamic layer-skipping framework that uses Markov Decision Policies to adapt computation based on input, achieving significant inference acceleration while maintaining competitive task performance.

Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes