CL LGMay 13

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

arXiv:2605.1376913.5

Predicted impact top 71% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers working on small-scale language models, this provides a nuanced comparison of MoE vs dense architectures, showing that MoE's advantage depends on the matching criterion.

The paper compares dense and mixture-of-experts (MoE) transformers at tiny scale (sub-25M parameters) under active-parameter and total-parameter matching. MoE achieves 0.0758 lower validation loss under active-parameter matching, but dense models are 0.0180 better under total-parameter matching.

We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

View on arXiv PDF

Similar