LGCLJan 8, 2024

Mixtral of Experts

arXiv:2401.04088v11929 citationsh-index: 27
Originality Highly original
AI Analysis

This work addresses the problem of high computational costs in large language models for AI researchers and practitioners, offering a more efficient alternative with competitive performance.

The paper tackles the challenge of scaling language models efficiently by introducing Mixtral 8x7B, a Sparse Mixture of Experts model that uses 13B active parameters during inference while accessing 47B parameters, outperforming or matching models like Llama 2 70B and GPT-3.5 across benchmarks, with particular gains in mathematics, code generation, and multilingual tasks.

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes