CLOct 11, 2021

On a Benefit of Mask Language Modeling: Robustness to Simplicity Bias

arXiv:2110.05301v16 citations
Originality Incremental advance
AI Analysis

This addresses a fundamental question in NLP for researchers and practitioners, though it is incremental as it builds on existing MLM theories.

The paper tackles the problem of understanding why masked language modeling (MLM) pretraining is beneficial by showing it makes models robust to lexicon-level spurious features, both theoretically and empirically, with experiments on hate speech detection and named entity recognition tasks.

Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a qeustion not fully answered. In this work we theoretically and empirically show that MLM pretraining makes models robust to lexicon-level spurious features, partly answer the question. We theoretically show that, when we can model the distribution of a spurious feature $Π$ conditioned on the context, then (1) $Π$ is at least as informative as the spurious feature, and (2) learning from $Π$ is at least as simple as learning from the spurious feature. Therefore, MLM pretraining rescues the model from the simplicity bias caused by the spurious feature. We also explore the efficacy of MLM pretraing in causal settings. Finally we close the gap between our theories and the real world practices by conducting experiments on the hate speech detection and the name entity recognition tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes