CLAIMay 5, 2021

Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards

arXiv:2105.01974v2662 citations
Originality Incremental advance
AI Analysis

This work addresses a critical quality issue in Explainable NLP datasets, which are used as ground-truth for building and evaluating models, highlighting that current standards are flawed and incremental improvements are needed.

The paper tackles the problem of logical validity in human-annotated explanations for Natural Language Inference (NLI) by proposing the Explanation Entailment Verification (EEV) methodology, revealing that a majority of explanations in three mainstream datasets are logically invalid, with issues ranging from incompleteness to clear logical errors.

An emerging line of research in Explainable NLP is the creation of datasets enriched with human-annotated explanations and rationales, used to build and evaluate models with step-wise inference and explanation generation capabilities. While human-annotated explanations are used as ground-truth for the inference, there is a lack of systematic assessment of their consistency and rigour. In an attempt to provide a critical quality assessment of Explanation Gold Standards (XGSs) for NLI, we propose a systematic annotation methodology, named Explanation Entailment Verification (EEV), to quantify the logical validity of human-annotated explanations. The application of EEV on three mainstream datasets reveals the surprising conclusion that a majority of the explanations, while appearing coherent on the surface, represent logically invalid arguments, ranging from being incomplete to containing clearly identifiable logical errors. This conclusion confirms that the inferential properties of explanations are still poorly formalised and understood, and that additional work on this line of research is necessary to improve the way Explanation Gold Standards are constructed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes