LGApr 3

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

arXiv:2604.0271566.4h-index: 1
Predicted impact top 39% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a performance bottleneck for serving large MoE models, enabling higher throughput under memory constraints, though it is an incremental improvement on existing inference systems.

The paper tackled the inefficiency of Mixture-of-Experts (MoE) models during inference, where idle expert weights compete with key-value cache memory, by introducing FluxMoE, a system that decouples expert residency to stream weights on demand, achieving up to 3.0× throughput gains over vLLM.

Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes