NI AI LGJan 16, 2025

MoE$^2$: Optimizing Collaborative Inference for Edge Large Language Models

Lyudong Jin, Yanning Zhang, Yanhan Li, Shurong Wang, Howard H. Yang, Jian Wu, Meng Zhang

arXiv:2501.09410v13.34 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient inference for edge LLMs, enabling cost-effectiveness and reduced latency for emerging applications, but it is incremental as it builds on existing MoE concepts.

The paper tackles the problem of optimizing collaborative inference for edge large language models (LLMs) by introducing the Mixture-of-Edge-Experts (MoE^2) framework, which formulates joint gating and expert selection to improve performance under energy and latency constraints, and results show it achieves optimal trade-offs and outperforms baselines under various resource constraints.

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective's monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE$^2$ method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.

View on arXiv PDF

Similar