AIApr 15

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

arXiv:2604.1441926.42 citationsh-index: 1

AI Analysis

For researchers building MoE language models, this work challenges the prevailing assumption that sophisticated routing mechanisms are crucial for quality, showing that simpler routing can achieve comparable performance.

The paper investigates whether routing topology determines language modeling quality in sparse Mixture-of-Experts (MoE) architectures. Through 62 controlled experiments, they find that routing topology does not determine asymptotic perplexity, with five cosine-routing variants statistically equivalent within a 1-PPL margin, and that a standard linear router's advantage is only ~1.2% over iso-parameter cosine routing.

Sparse Mixture-of-Experts (MoE) architectures employ increasingly sophisticated routing mechanisms -- learned routers, multi-hop trajectories, token-dependent gating. We ask: does routing topology actually determine language modeling quality? We build a geometric MoE (ST-MoE) using cosine-similarity routing against learned centroids in a low-dimensional space ($d_{space} = 64$), requiring 80% fewer routing parameters than standard linear routers. Through 62 controlled experiments on WikiText-103 at 76--84M parameters trained to convergence (50K steps, 1.64B tokens), we find that routing topology does not determine asymptotic perplexity (PPL): five cosine-routing variants are statistically equivalent within a 1-PPL margin (Two One-Sided Tests [TOST], $p < 0.05$ for all 10 pairwise comparisons; 15 runs across 3 seeds, observed range 33.93--34.72). The finding extends to hash, random-fixed, and top-1 routing (single-seed; graceful 1.1--2.2 PPL degradation) and replicates on OpenWebText (0.03 PPL gap, 6 runs, 3 seeds each). A standard linear router with 5.3$\times$ more routing parameters reaches PPL 32.76, but iso-parameter cosine routing closes 67% of this gap -- the true mechanism advantage is $\sim$1.2%. The mechanistic explanation is convergent redundancy: multi-hop updates are collinear ($\cos(Δh_0, Δh_1) = 0.805$), implementing magnitude amplification rather than compositional reasoning; a single learnable scalar replicates multi-hop performance. As a practical payoff, zero-shot relative-norm halting saves 25% of MoE FLOPs at +0.12% PPL. Expert-level specialization and causal controllability -- which coexist with topology-level equifinality -- are explored in a companion paper.

View on arXiv PDF

Similar