CLAILGApr 23, 2024

Multi-Head Mixture-of-Experts

arXiv:2404.15045v137 citationsh-index: 41NIPS
Originality Incremental advance
AI Analysis

This addresses efficiency and capability limitations in large-scale models for NLP and multi-modal applications, but is incremental as it builds on existing SMoE methods.

The paper tackles issues in Sparse Mixtures of Experts (SMoE) such as low expert activation and lack of fine-grained analysis by proposing Multi-Head Mixture-of-Experts (MH-MoE), which splits tokens into sub-tokens for parallel expert processing, showing effectiveness across language and multi-modality modeling tasks.

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes