InforMask: Unsupervised Informative Masking for Language Model Pretraining
This addresses the problem of inefficient pretraining for natural language understanding by providing a more effective masking strategy, though it is incremental as it builds on existing masked language modeling approaches.
The paper tackled the suboptimal random masking in language model pretraining by proposing InforMask, an unsupervised masking strategy using Pointwise Mutual Information to select informative tokens, which outperformed random and previous methods on benchmarks like LAMA and SQuAD v1/v2 with improved factual recall and question answering performance.
Masked language modeling is widely used for pretraining large language models for natural language understanding (NLU). However, random masking is suboptimal, allocating an equal masking rate for all tokens. In this paper, we propose InforMask, a new unsupervised masking strategy for training masked language models. InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask. We further propose two optimizations for InforMask to improve its efficiency. With a one-off preprocessing step, InforMask outperforms random masking and previously proposed masking strategies on the factual recall benchmark LAMA and the question answering benchmark SQuAD v1 and v2.