LGPFJan 25, 2024

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

arXiv:2401.14361v317 citationsHas Code
Originality Incremental advance
AI Analysis

It addresses the challenge of running large MoE-based LLMs efficiently on personal machines, which is an incremental improvement for users with limited hardware resources.

This paper tackles the problem of efficient inference for Mixture-of-Experts (MoE) models on personal machines with limited GPU memory by introducing a sparsity-aware expert cache that leverages activation sparsity in single-user, batch-size-one settings, resulting in 3.1-16.7x per-token latency improvements over state-of-the-art systems like vLLM and DeepSpeed across models such as DeepSeek and Mixtral.

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes