LG AI DC ITNov 11, 2024

WDMoE: Wireless Distributed Mixture of Experts for Large Language Models

Nan Xue, Yaping Sun, Zhiyong Chen, Meixia Tao, Xiaodong Xu, Liang Qian, Shuguang Cui, Wenjun Zhang, Ping Zhang

arXiv:2411.06681v115.021 citationsh-index: 8IEEE Trans Wirel Commun

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient LLM deployment in resource-constrained wireless environments for edge computing applications, representing an incremental improvement by adapting existing MoE methods to a new domain.

The paper tackles the problem of deploying large language models (LLMs) in wireless networks by proposing a wireless distributed Mixture of Experts (WDMoE) architecture that splits model components between base stations and mobile devices, resulting in significantly reduced latency without compromising performance, as validated through simulations and hardware experiments.

Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance.

View on arXiv PDF

Similar