LG AIJun 19, 2025

Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models

Daniel Fidel Harvey, George Weale, Berk Yilmaz

arXiv:2506.16419v12 citations

Originality Incremental advance

AI Analysis

This work provides a comparative analysis for optimizing MoE router performance in large-scale language models, but it is incremental as it builds on existing router designs without introducing a paradigm shift.

This project tackled the problem of suboptimal routing in Mixture of Experts (MoE) architectures, which can cause load imbalance and reduced accuracy, by designing and implementing six router variants including a new MLP-Hadamard router, and evaluated them on BERT and Qwen1.5-MoE models to show trade-offs in speed, expressiveness, and structured sparse routing.

Mixture of Experts (MoE) architectures increase large language model scalability, yet their performance depends on the router module that moves tokens to specialized experts. Bad routing can load imbalance and reduced accuracy. This project designed and implemented different router architectures within Transformer models to fix these limitations. We experimented with six distinct router variants Linear, Attention, Multi-Layer Perceptron (MLP), Hybrid, Hash, and our new MLP-Hadamard. We characterized these routers using BERT and the Qwen1.5-MoE model, looking at parameter efficiency, inference latency, routing entropy, and expert utilization patterns. Our evaluations showed distinct trade-offs: Linear routers offer speed, while MLP and Attention routers provide greater expressiveness. The MLP-Hadamard router shows a unique capability for structured, sparse routing. We successfully replaced and fine-tuned custom routers within the complex, quantized Qwen1.5-MoE model. This work provides a comparative analysis of MoE router designs and offers insights into optimizing their performance for efficient and effective large-scale model deployment.

View on arXiv PDF

Similar