LG DCNov 3, 2024

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo

arXiv:2411.01433v225.739 citationsh-index: 13

Originality Highly original

AI Analysis

This work addresses the problem of efficient MoE inference for edge computing, offering a significant performance improvement over existing methods.

The paper tackles the challenge of deploying Mixture-of-Experts (MoE) large language models on memory-constrained edge devices by introducing HOBBIT, a mixed precision expert offloading system that achieves up to a 9.93x speedup in decoding compared to state-of-the-art offloading systems while preserving model accuracy.

The Mixture-of-Experts (MoE) architecture has demonstrated significant advantages in the era of Large Language Models (LLMs), offering enhanced capabilities with reduced inference costs. However, deploying MoE-based LLMs on memoryconstrained edge devices remains challenging due to their substantial memory requirements. While existing expertoffloading methods alleviate the memory requirements, they often incur significant expert-loading costs or compromise model accuracy. We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: (1) a token-level dynamic expert loading mechanism, (2) a layer-level adaptive expert prefetching technique, and (3) a sequence-level multidimensional expert caching policy. These innovations fully leverage the benefits of mixedprecision expert inference. By implementing HOBBIT on top of the renowned LLM inference framework Llama.cpp, we evaluate its performance across different edge devices with representative MoE models. The results demonstrate that HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.

View on arXiv PDF

Similar