LG AIAug 8, 2025

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

arXiv:2508.06041v39.42 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient query handling for on-device LLMs, though it is incremental as it builds on mixed-precision quantization.

The paper tackles the problem of adapting on-device large language models (LLMs) to varying runtime constraints like latency and accuracy by introducing DP-LLM, a mechanism that dynamically assigns precision to each layer based on input values, achieving a superior performance-latency trade-off and outperforming prior approaches.

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

View on arXiv PDF

Similar