CLFeb 4, 2019

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R. Thomas McCoy, Ellie Pavlick, Tal Linzen

arXiv:1902.01007v437.01574 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the issue of overfitting to superficial patterns in NLI for researchers and practitioners, though it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of natural language inference (NLI) models relying on fallible syntactic heuristics, such as lexical overlap, by introducing the HANS dataset for controlled evaluation, and found that state-of-the-art models like BERT performed very poorly on it, indicating adoption of these heuristics.

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area

View on arXiv PDF Code

Similar