LGNov 28, 2024

On the Role of Discrete Representation in Sparse Mixture of Experts

arXiv:2411.19402v23 citationsh-index: 3Trans. Mach. Learn. Res.
Originality Highly original
AI Analysis

This addresses a critical weakness in scaling large models efficiently for ML/AI practitioners, though it appears incremental as it builds on existing SMoE frameworks.

The paper tackles routing inconsistencies and representation collapse in sparse mixture of experts (SMoE) by proposing Vector-Quantized Mixture of Experts (VQMoE), which uses discrete representations via vector quantization to assign experts indirectly, achieving a 28% improvement in robustness compared to other SMoE routing methods.

Sparse mixture of experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via indirection, which employs the discrete representation of input that points to the expert. The discrete representations are learnt via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE's ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28% improvement in robustness compared to other SMoE routing methods, while maintaining strong performance in fine-tuning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes