CLAug 8, 2024

LaDiMo: Layer-wise Distillation Inspired MoEfier

arXiv:2408.04278v13 citationsh-index: 3
Originality Incremental advance
AI Analysis

This provides a flexible and efficient solution for building and deploying MoE models, addressing resource and environmental concerns in NLP, though it is incremental as it builds on existing knowledge distillation and MoE techniques.

The paper tackles the high training costs of large language models by proposing LaDiMo, an algorithm that efficiently converts a Transformer-based non-MoE model into a MoE model, reducing activated parameters by over 20% while maintaining accuracy using only 100K tokens.

The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and rapidly recover its performance. Furthermore, we develop an adaptive router that optimizes inference efficiency by profiling the distribution of routing weights and determining a layer-wise policy that balances accuracy and latency. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens, reducing activated parameters by over 20% while keeping accuracy. Our approach offers a flexible and efficient solution for building and deploying MoE models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes