CLApr 26, 2021

PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong

arXiv:2104.12369v113.5241 citations

Originality Incremental advance

AI Analysis

This work provides a large-scale Chinese language model that addresses the need for high-performance NLP tools in Chinese domains, though it is incremental as it adapts existing paradigms like GPT-3 to Chinese data and infrastructure.

The authors tackled the problem of scaling up autoregressive language models for Chinese NLP by training PanGu-α with up to 200 billion parameters, achieving superior performance in few-shot and zero-shot settings across tasks like text summarization and question answering.

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$α$, with up to 200 billion parameters. PanGu-$α$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$α$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$α$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$α$ in performing various tasks under few-shot or zero-shot settings.

View on arXiv PDF

Similar