AIAug 29, 2023

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

arXiv:2308.15030v439 citationsh-index: 46
Originality Incremental advance
AI Analysis

This addresses the problem of high memory usage for deploying MoE models on resource-limited devices, representing an incremental improvement over existing methods like memory swapping or expert pruning.

The paper tackles the challenge of serving Mixture of Experts (MoE)-based large language models on memory-constrained devices by introducing SwapMoE, a framework that reduces memory consumption from 14.2 GiB to 4.7 GiB with a slight Rouge-2 score drop of 0.041 and 50% latency reduction.

Mixture of experts (MoE) is a popular technique to improve capacity of Large Language Models (LLMs) with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes