LGAIDCApr 28

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

arXiv:2604.2603979.6
Predicted impact top 16% in LG · last 90 daysOriginality Highly original
AI Analysis

For production MoE inference systems, RaMP provides a practical, kernel-agnostic framework to recover significant throughput lost due to static dispatch policies.

RaMP achieves 1.22x kernel speedup and 1.30x end-to-end speedup in vLLM serving by dynamically selecting optimal MoE kernel configurations based on runtime expert routing distributions, reducing throughput loss from 10-70% to 0.93% regret.

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes