LGMay 21, 2025

Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC

Qingyu Song, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan

arXiv:2505.15030v34.12 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work provides practical guidelines for efficiently deploying LLMs on resource-constrained devices, addressing privacy and performance issues for edge computing applications.

The paper tackles the challenge of deploying large language models on edge devices by introducing a systematic evaluation methodology and conducting comprehensive experiments on models up to 14B parameters with various quantization methods, finding that larger models with low-bit quantization outperform smaller ones with higher precision and that quantization offers significant memory savings with minimal accuracy loss.

The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.

View on arXiv PDF Code

Similar