LGAIMay 13, 2025

Model-Distributed Inference for Large Language Models at the Edge

arXiv:2505.18164v15 citationsh-index: 20LANMAN
Originality Incremental advance
AI Analysis

This addresses the challenge of running state-of-the-art LLMs on resource-constrained edge hardware, which is incremental as it builds on distributed computing techniques.

The paper tackles the problem of deploying large language models on low-power edge devices by introducing a model-distributed inference framework that partitions models across devices, enabling inference on models exceeding individual device memory and boosting throughput with more devices.

We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes