LGSep 28, 2025

PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference

Enda Yu, Zhaoning Zhang, Dezun Dong, Yongwei Wu, Xiangke Liao

arXiv:2509.23638v114.44 citationsh-index: 18

Originality Highly original

AI Analysis

This addresses inference efficiency problems for users deploying large MoE models on resource-constrained hardware, representing a strong specific gain rather than a foundational breakthrough.

The paper tackled memory and PCIe latency bottlenecks in Mixture-of-Experts (MoE) models on commodity hardware by developing PreScope, a prediction-driven expert scheduling system, which achieved 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

View on arXiv PDF

Similar