LG AIJan 30

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

arXiv:2602.11192v13.82 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the problem of high memory usage for MoE models in resource-constrained settings, offering an incremental improvement over prior offloading methods.

The paper tackles the memory bottleneck in Mixture-of-Experts models by fine-tuning them to activate fewer experts per sequence, which reduces CPU-GPU transfer overhead and increases throughput by 1.2-3x over efficient baselines and up to 14.7x over transfer-heavy baselines while maintaining or improving task performance.

Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by $1.2-3\times$ over efficient baselines and up to $14.7\times$ over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.

View on arXiv PDF

Similar