CL AI LGNov 18, 2022

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Microsoft

arXiv:2211.10017v124.7299 citationsh-index: 11

Originality Highly original

AI Analysis

This work addresses the problem of inefficient inference and high memory requirements for deploying large MoE models in cloud production, enabling a paradigm shift from distilling models to direct deployment.

The paper tackles the challenge of deploying large-scale Mixture of Experts (MoE) models in production by introducing an efficient inference framework that reduces memory usage and speeds up computation, achieving up to 26x throughput speed-up, reducing model size to one-eighth via 4-bit quantization, and enabling deployment of 136x larger models with 27% less cost and better quality.

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task.

View on arXiv PDF

Similar