CL LGAug 13, 2021

Towards Structured Dynamic Sparse Pre-Training of BERT

Anastasia Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi

arXiv:2108.06277v12.417 citations

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient training methods for large language models, which is an incremental improvement in the field of machine learning optimization.

The paper tackled the problem of computationally efficient unsupervised training of large language models by developing a dynamic always-sparse pre-training approach for BERT, achieving Pareto improvements in FLOPs over static sparse and dense models across various network sizes.

Identifying algorithms for computational efficient unsupervised training of large language models is an important and active area of research. In this work, we develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling task, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation. This approach enables us to achieve Pareto improvements in terms of the number of floating-point operations (FLOPs) over statically sparse and dense models across a broad spectrum of network sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

View on arXiv PDF

Similar