CLAIMay 24, 2023

Dynamic Masking Rate Schedules for MLM Pretraining

arXiv:2305.15096v3109 citations
Originality Incremental advance
AI Analysis

This work addresses the efficiency and performance of masked language models for NLP practitioners, offering an incremental improvement over existing methods.

The paper tackles the problem of improving transformer pretraining with Masked Language Modeling by proposing a dynamic masking rate schedule instead of the fixed 15% rate. The result is an average GLUE accuracy improvement of up to 0.46% for BERT-base and 0.25% for BERT-large, along with up to a 1.89x speedup in pretraining.

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes