LGOct 6, 2023

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

arXiv:2310.04361v414 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the inference efficiency problem for users of large Transformer models, offering a significant computational reduction, though it is incremental as it builds on existing MoE conversion methods.

The paper tackled the high computational cost of Transformer models by converting dense layers to dynamic-k Mixture-of-Experts layers, leveraging activation sparsity and introducing a per-token expert selection rule, resulting in up to 60% reduction in inference cost with minimal performance impact on NLP and vision tasks.

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes