LGSep 3, 2025

Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning

arXiv:2509.10513v13 citationsh-index: 13EMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of handling input heterogeneity in instruction tuning for AI models, offering an incremental improvement over existing MoE methods.

The paper tackles the challenge of improving expert specialization and generalization in sparse Mixture-of-Experts architectures for instruction tuning by proposing Mixture-of-Clustered-Experts (MoCE), which uses a dual-stage routing mechanism to partition heterogeneous inputs and achieves consistent superiority over strong baselines in benchmarks.

A sparse Mixture-of-Experts (MoE) architecture has emerged as a highly scalable solution by conditionally activating sub-modules without a proportional increase in computational costs. However, improving expert specialization to enhance performance and generalization remains a challenge for MoE, especially in instruction tuning scenarios characterized by significant input heterogeneity. In this work, we propose the Mixture-of-Clustered-Experts (MoCE) to address this limitation through a dual-stage routing mechanism. The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-$k$ experts within the group at the token level. This approach enables the effective partitioning of heterogeneous inputs based on their knowledge requirements, encouraging expert group specialization while maintaining the advantages of token-level routing. We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities. Detailed analysis further highlights the robustness and effectiveness of MoCE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes