LGSep 2, 2025

LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

Krishna Teja Chitty-Venkata, Sandeep Madireddy, Murali Emani, Venkatram Vishwanath

arXiv:2509.02753v19.42 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses inference-time compute efficiency for MoE models, which is crucial for deploying large-scale AI systems, though it appears incremental as it builds on existing MoE pruning strategies.

The paper tackled the problem of inefficient inference in Mixture-of-Experts (MoE) models by introducing LExI, a data-free optimization technique that adaptively assigns the number of active experts per layer, resulting in significant improvements in inference efficiency with negligible accuracy loss, such as achieving 10% better accuracy at the same throughput on a GPU.

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE pruning strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as vLLM. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert pruning.

View on arXiv PDF

Similar