CLAIOct 15, 2020

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

arXiv:2010.07676v11 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for trustworthy evaluation settings in NLI research to prevent overestimated results from biased datasets, though it is incremental as it builds on existing debiasing methods and benchmarks.

The paper tackled the problem of unreliable evaluations in Natural Language Inference due to dataset biases, by proposing a cross-dataset benchmark with 14 datasets and re-evaluating 14 models, showing that models often overperform on in-domain tests but generalize poorly across datasets.

Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes