CLSep 16, 2019

Probing Natural Language Inference Models through Semantic Fragments

Kyle Richardson, Hai Hu, Lawrence S. Moss, Ashish Sabharwal

arXiv:1909.07521v210.6151 citations

Originality Incremental advance

AI Analysis

This work addresses the need for more precise evaluation of linguistic reasoning in AI models, though it is incremental as it builds on existing probing and fine-tuning methods.

The study tackled the problem of whether state-of-the-art natural language inference models effectively capture advanced semantic phenomena like boolean coordination and quantification, finding that models like BERT performed poorly on new semantic fragment datasets but could master them with minimal fine-tuning while maintaining benchmark performance.

Do state-of-the-art models for language understanding already have, or can they easily learn, abilities such as boolean coordination, quantification, conditionals, comparatives, and monotonicity reasoning (i.e., reasoning about word substitutions in sentential contexts)? While such phenomena are involved in natural language inference (NLI) and go beyond basic linguistic understanding, it is unclear the extent to which they are captured in existing NLI benchmarks and effectively learned by models. To investigate this, we propose the use of semantic fragments---systematically generated datasets that each target a different semantic phenomenon---for probing, and efficiently improving, such capabilities of linguistic models. This approach to creating challenge datasets allows direct control over the semantic diversity and complexity of the targeted linguistic phenomena, and results in a more precise characterization of a model's linguistic behavior. Our experiments, using a library of 8 such semantic fragments, reveal two remarkable findings: (a) State-of-the-art models, including BERT, that are pre-trained on existing NLI benchmark datasets perform poorly on these new fragments, even though the phenomena probed here are central to the NLI task. (b) On the other hand, with only a few minutes of additional fine-tuning---with a carefully selected learning rate and a novel variation of "inoculation"---a BERT-based model can master all of these logic and monotonicity fragments while retaining its performance on established NLI benchmarks.

View on arXiv PDF

Similar