LGOct 17, 2024

Sparse Mixture-of-Experts for Compositional Generalization: Empirical Evidence and Theoretical Foundations of Optimal Sparsity

Jinze Zhao, Peihao Wang, Junjie Yang, Ruisi Cai, Gaowen Liu, Jayanth Srinivasa, Ramana Rao Kompella, Yingbin Liang, Zhangyang Wang

arXiv:2410.13964v26.43 citationsh-index: 33

Originality Highly original

AI Analysis

It addresses the challenge of designing computationally efficient SMoE models for robust compositional generalization, which is incremental as it builds on existing SMoE architectures.

This paper tackles the problem of compositional generalization in Sparse Mixture-of-Experts (SMoE) models by showing that optimal sparsity scales with task complexity, not minimal activation, with empirical evidence from SRAVEN and SKILL-MIX benchmarks demonstrating that performance requires more activated experts as difficulty increases.

Sparse Mixture-of-Experts (SMoE) architectures have gained prominence for their ability to scale neural networks, particularly transformers, without a proportional increase in computational cost. Despite their success, their role in compositional generalization, i.e., adapting to novel combinations of known components, remains under-explored. This study challenges the assumption that minimal expert activation suffices for task generalization and investigates the relationship between task complexity and optimal sparsity in SMoE models. Through empirical evaluations on the SRAVEN symbolic reasoning task and the SKILL-MIX benchmark, we demonstrate that (i) the number of activated experts consistently increases with the perceived task difficulty to maintain performance; and (ii) the optimal number of activated experts scales proportionally with task complexity. Our theoretical analysis derives a scaling law for optimal sparsity by balancing approximation and estimation errors, revealing alignment with empirical observations. We formally show that the optimal sparsity lies between minimal activation (1-2 experts) and full activation, with the exact number scaling proportionally to task complexity and further influenced by the size of the training data and the complexity of the model. These findings offer practical insights for designing SMoE models that achieve computational efficiency while enabling robust compositional generalization.

View on arXiv PDF

Similar