LGOCMLJun 7, 2021

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

arXiv:2106.03760v3198 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses convergence and statistical performance issues in multi-task learning for researchers and practitioners using gradient-based methods, though it is incremental as it builds on existing MoE architectures.

The paper tackles the problem of non-smooth sparse gates in Mixture-of-Experts models, which cause convergence and performance issues, by introducing DSelect-k, a differentiable gate that improves prediction and selection, achieving over 22% better performance than Top-k on a recommender system.

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes