LGJun 10, 2024

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, Haibo Chen

arXiv:2406.06282v334.1114 citations

Originality Highly original

AI Analysis

This enables real-time, privacy-preserving AI assistance on smartphones, addressing a significant deployment bottleneck for mobile AI applications.

The paper tackles the problem of running large language models (LLMs) on smartphones, which are limited by memory and compute constraints, by introducing PowerInfer-2, a framework that achieves up to a 27.8x speed increase and serves a 47B LLM at 11.68 tokens/s with negligible accuracy loss.

Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small language models (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.

View on arXiv PDF

Similar