DC LGDec 16, 2024

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference

Yujie Zhang, Shivam Aggarwal, Tulika Mitra

arXiv:2501.10375v29.218 citationsh-index: 7Has CodeDATE

Originality Highly original

AI Analysis

This addresses deployment challenges for MoE models on resource-limited devices, offering a significant performance improvement.

The paper tackles the problem of inefficient MoE model inference on memory-constrained devices by proposing DAOP, a data-aware offloading and predictive pre-calculation engine, which achieves up to 8.20x speedup over traditional methods while maintaining accuracy.

Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

View on arXiv PDF Code

Similar