Robustness of Mixtures of Experts to Feature Noise
This provides insights into the robustness of MoE models for machine learning practitioners dealing with noisy data, though it is incremental in explaining existing performance gains.
The paper tackles the problem of understanding why Mixture of Experts (MoE) models outperform dense networks beyond parameter scaling, showing that sparse expert activation acts as a noise filter, leading to lower generalization error, improved robustness to perturbations, and faster convergence speed under feature noise.
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.