AILGApr 16

Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

arXiv:2604.1500910.9
Predicted impact top 86% in AI · last 90 daysOriginality Highly original
AI Analysis

This work addresses the efficiency bottleneck of generative language models, offering a non-autoregressive approach that dramatically reduces inference time while maintaining quality.

Flow matching for language modeling struggles with complex latent distributions. The proposed mixture-of-experts flow matching (MoE-FM) framework, instantiated as YAN, achieves generation quality on par with autoregressive and diffusion-based models while requiring as few as three sampling steps, yielding a 40× speedup over AR baselines and up to 10³× speedup over diffusion language models.

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a $40\times$ speedup over AR baselines and up to a $10^3\times$ speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes