Adversarial NLI: A New Benchmark for Natural Language Understanding
This addresses the need for more robust and dynamic evaluation in natural language understanding, though it is incremental as it builds on existing NLI benchmarks.
The authors tackled the problem of evaluating natural language understanding by creating a new adversarial NLI benchmark dataset, which leads to state-of-the-art performance on existing benchmarks and exposes weaknesses in current models.
We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.