MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models
This addresses efficiency bottlenecks in scaling large language models, though it appears incremental as it builds on existing MoE paradigms.
The paper tackles the high memory and communication overhead of standard Mixture of Experts (MoE) architectures in Large Language Models by introducing Mixture of Latent Experts (MoLAE), which uses a shared projection into a lower-dimensional latent space to reduce parameters and computational requirements while maintaining comparable performance.
Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.