CLLGMLMar 24, 2022

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

arXiv:2203.13151v2224 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the high computational cost of pre-training large language models for researchers and practitioners, though it is incremental as it builds on existing Bayesian optimization and bandit methods.

The paper tackles the problem of computationally expensive hyperparameter selection in Transformer language model pre-training by proposing a multi-armed bandit framework with Thompson sampling, which achieves lower Masked Language Model loss in fewer epochs and competitive downstream performance while saving computational resources.

We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters. We propose a multi-armed bandit framework for the sequential selection of TLM pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance. We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes