CLAug 21, 2015

A large annotated corpus for learning natural language inference

arXiv:1508.05326v14695 citations
Originality Incremental advance
AI Analysis

This addresses a bottleneck for machine learning researchers in natural language processing by providing a foundational dataset, though it is incremental as it builds on existing tasks.

The authors tackled the lack of large-scale resources for natural language inference by introducing the Stanford Natural Language Inference corpus, a 570K-pair dataset that enabled lexicalized classifiers to outperform some existing models and allowed neural networks to perform competitively on benchmarks.

Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes