LGSep 28, 2025

PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference

arXiv:2509.23638v14 citationsh-index: 18
Originality Highly original
AI Analysis

This addresses inference efficiency problems for users deploying large MoE models on resource-constrained hardware, representing a strong specific gain rather than a foundational breakthrough.

The paper tackled memory and PCIe latency bottlenecks in Mixture-of-Experts (MoE) models on commodity hardware by developing PreScope, a prediction-driven expert scheduling system, which achieved 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes