CLJun 2, 2018

Stress Test Evaluation for Natural Language Inference

arXiv:1806.00692v31268 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation methods in natural language understanding, though it is incremental as it focuses on testing existing models rather than introducing new ones.

The authors tackled the problem of evaluating whether natural language inference models truly understand semantics by proposing automatically constructed stress tests, revealing strengths and weaknesses in six sentence-encoder models.

Natural language inference (NLI) is the task of determining if a natural language hypothesis can be inferred from a given premise in a justifiable manner. NLI was proposed as a benchmark task for natural language understanding. Existing models perform well at standard datasets for NLI, achieving impressive results across different genres of text. However, the extent to which these models understand the semantic content of sentences is unclear. In this work, we propose an evaluation methodology consisting of automatically constructed "stress tests" that allow us to examine whether systems have the ability to make real inferential decisions. Our evaluation of six sentence-encoder models on these stress tests reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena, and suggests important directions for future work in this area.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes