CLOct 5, 2021

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

arXiv:2110.01786v3671 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses efficiency and interpretability issues in large language models for researchers and practitioners, though it is incremental as it builds on known sparsity patterns.

The paper tackled the unclear computational patterns of feed-forward networks (FFNs) in Transformers by proposing MoEfication, which converts models into Mixture of Experts versions, achieving over 95% original performance while using only 10% to 30% of FFN parameters and enabling a 2x speedup with 25% parameters.

Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from https://github.com/thunlp/MoEfication.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes