CLJun 24, 2024

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

arXiv:2406.16554v1155 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently scaling large language models for AI researchers and practitioners, but it is incremental as it adapts an existing model rather than introducing a new paradigm.

The paper tackles the data-hungry and instability problems in training large-scale Mixture-of-Experts (MoE) models by building MoE from the existing LLaMA-2 7B model through expert construction and continual pre-training, resulting in LLaMA-MoE-3.5B models that significantly outperform dense models with similar activation parameters after training on 200B tokens.

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes