On the Spatial Structure of Mixture-of-Experts in Transformers
This addresses a fundamental aspect of MoE-based architectures in AI, potentially improving their design and efficiency, though it appears incremental in scope.
The paper challenges the assumption that Mixture-of-Experts routers rely solely on semantic features, showing that positional token information is crucial for routing decisions, supported by empirical analysis and a phenomenological explanation.
A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.