AILGFeb 17, 2025

Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate

arXiv:2502.12224v24 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses efficient inference for MoE models in resource-constrained edge scenarios, representing an incremental improvement over existing offload-based methods.

The paper tackles the challenge of deploying sparse-activated Mixture-of-Experts (MoE) models on edge devices by proposing Fate, an offloading system that uses cross-layer gate inputs for expert prefetching and achieves up to 4.5x speedups in inference while maintaining quality.

Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99\%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate's performance improvements are scalable across different memory budgets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes