CLAIMar 12, 2024

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Meta AIMIT
arXiv:2403.07816v1106 citationsh-index: 34
Originality Incremental advance
AI Analysis

This addresses the need for scalable multi-domain LLM training, offering a method to reduce communication costs and improve throughput, though it appears incremental as it builds on existing Branch-Train-Merge and sparse upcycling techniques.

The paper tackles the problem of efficiently training large language models with capabilities in multiple specialized domains by proposing Branch-Train-MiX (BTX), which branches a seed model to train experts in parallel, combines them into a Mixture-of-Experts (MoE) model, and finetunes for routing, achieving the best accuracy-efficiency tradeoff compared to alternatives.

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes