LG CLFeb 28, 2025

CoSMoEs: Compact Sparse Mixture of Experts

Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar

Meta AIMILA

arXiv:2503.00245v11 citationsh-index: 13

Originality Incremental advance

AI Analysis

This work addresses on-device inference challenges for deploying efficient AI models, though it appears incremental by adapting existing MoE methods to smaller scales.

The paper tackles enabling compact sparse mixture of experts (CoSMoEs) for on-device inference by addressing quality, memory, and latency, showing that MoE architectures outperform dense models in fair evaluations and introducing weight-decomposed experts to improve performance.

Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

View on arXiv PDF

Similar