Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
This addresses the problem of robust routing in large language models for AI researchers and practitioners, offering an incremental improvement over existing methods.
The paper tackles the brittleness of frozen routers in Mixture-of-Experts architectures under distribution shifts by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from similar past cases, resulting in performance that outperforms zero-shot baselines and rivals supervised fine-tuning.
Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.