CL LGOct 22, 2025

MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs

Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li

arXiv:2510.19366v14.91 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the problem of resource over-provisioning and inflexible cost-quality trade-offs for AI service providers, offering incremental improvements in elasticity and efficiency.

The paper tackled the inflexibility of Mixture-of-Experts (MoE) models, which have a 'quality cliff' limiting operating points, by introducing MoE-Prism, a model-system co-design that refactors monolithic experts into fine-grained sub-experts and enables elastic services, resulting in over 4 times more operating points, up to 19.9% throughput improvement under latency constraints, and up to 10.36% latency reduction under resource limits.

Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.

View on arXiv PDF

Similar