CL LG MLJun 30, 2020

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

arXiv:2006.16668v131.92196 citations

Originality Highly original

AI Analysis

This addresses the challenge of efficiently scaling giant models for real-world applications like multilingual translation, representing a significant advance rather than an incremental improvement.

The authors tackled scaling neural networks to over 600 billion parameters by developing GShard, a system using conditional computation and automatic sharding, which trained a multilingual translation model in 4 days on 2048 TPUs to achieve far superior translation quality across 100 languages.

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

View on arXiv PDF

Similar