CLAIJun 5, 2023

KNOW How to Make Up Your Mind! Adversarially Detecting and Alleviating Inconsistencies in Natural Language Explanations

arXiv:2306.02980v1225 citationsh-index: 72Has Code
Originality Incremental advance
AI Analysis

This addresses the reliability of AI explanations for users, but it is incremental as it builds on existing adversarial attacks and mitigation approaches.

The paper tackled the problem of detecting and alleviating inconsistencies in natural language explanations (NLEs) generated by models, showing that higher NLE quality does not correlate with fewer inconsistencies and proposing a mitigation method that reduces inconsistencies by grounding models in external knowledge.

While recent works have been considerably improving the quality of the natural language explanations (NLEs) generated by a model to justify its predictions, there is very limited research in detecting and alleviating inconsistencies among generated NLEs. In this work, we leverage external knowledge bases to significantly improve on an existing adversarial attack for detecting inconsistent NLEs. We apply our attack to high-performing NLE models and show that models with higher NLE quality do not necessarily generate fewer inconsistencies. Moreover, we propose an off-the-shelf mitigation method to alleviate inconsistencies by grounding the model into external background knowledge. Our method decreases the inconsistencies of previous high-performing NLE models as detected by our attack.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes