CLJul 17, 2019

Probing Neural Network Comprehension of Natural Language Arguments

arXiv:1907.07355v21313 citations
AI Analysis

This exposes a critical flaw in existing NLP benchmarks for argument comprehension, requiring more robust evaluation methods.

The paper found that BERT's 77% performance on the Argument Reasoning Comprehension Task was close to human baseline but entirely due to exploiting spurious statistical cues in the dataset, and they created an adversarial dataset where all models achieved random accuracy.

We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes