CLDec 2, 2019

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, Samuel R. Bowman

arXiv:1912.00582v433.41134 citationsHas Code

Originality Synthesis-oriented

AI Analysis

It provides a tool for evaluating language models on grammatical phenomena, addressing a need for researchers in natural language processing, though it is incremental as it builds on existing benchmark practices.

The paper introduced BLiMP, a benchmark of 67 sub-datasets with 1000 minimal pairs each to evaluate language models' knowledge of English grammar, finding that state-of-the-art models reliably handle morphological contrasts but struggle with semantic restrictions and subtle syntactic phenomena.

We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.

View on arXiv PDF Code

Similar