CL LGApr 14, 2021

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina Williams, Douwe Kiela

arXiv:2104.06644v232.8724 citations

Originality Incremental advance

AI Analysis

This challenges the assumption that MLMs rely on syntax, suggesting their success is largely distributional, which could impact how we evaluate and design models for NLP tasks.

The paper investigates whether masked language models (MLMs) succeed due to learning syntactic structures or higher-order word co-occurrence statistics, by pre-training MLMs on shuffled word order sentences and showing they still achieve high accuracy on downstream tasks, including those designed to challenge order-ignoring models.

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

View on arXiv PDF

Similar