LGCLMar 29, 2025

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

arXiv:2503.23100v25 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks in scaling large language models, though it appears incremental as it builds on existing MoE paradigms.

The paper tackles the high memory and communication overhead of standard Mixture of Experts (MoE) architectures in Large Language Models by introducing Mixture of Latent Experts (MoLAE), which uses a shared projection into a lower-dimensional latent space to reduce parameters and computational requirements while maintaining comparable performance.

Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes