LGCLMLOct 5, 2020

PMI-Masking: Principled masking of correlated spans

arXiv:2010.01825v185 citations
Originality Highly original
AI Analysis

This addresses a common flaw in MLM pretraining for NLP researchers and practitioners, offering a principled improvement over heuristic masking approaches.

The paper tackles the inefficiency of uniform random masking in pretraining Masked Language Models like BERT, which leads to suboptimal performance, by proposing PMI-Masking, a strategy that masks correlated token n-grams based on Pointwise Mutual Information; experiments show it reaches prior methods' performance in half the training time and improves final results.

Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes