CLAILGJan 29

Scaling Embeddings Outperforms Scaling Experts in Language Models

arXiv:2601.21204v29 citationsh-index: 8
Originality Highly original
AI Analysis

This work addresses scaling bottlenecks in large language models for AI researchers and practitioners, offering a novel approach that is incremental but with strong specific gains.

The paper tackles the problem of diminishing returns in Mixture-of-Experts architectures for large language models by exploring embedding scaling as an alternative for sparsity scaling, resulting in LongCat-Flash-Lite, a 68.5B parameter model that surpasses parameter-equivalent MoE baselines and shows competitiveness in agentic and coding domains.

While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes