CLLGMar 24, 2022

Token Dropping for Efficient BERT Pretraining

UW
arXiv:2203.13240v1657 citationsh-index: 41
Originality Incremental advance
AI Analysis

This incremental improvement addresses the high computational cost of pretraining transformer models like BERT for researchers and practitioners.

The paper tackles the computational inefficiency of transformer models by introducing a token dropping method that reduces pretraining cost by 25% while maintaining similar performance on downstream tasks.

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes