CL AIJun 5, 2023

KNOW How to Make Up Your Mind! Adversarially Detecting and Alleviating Inconsistencies in Natural Language Explanations

Myeongjun Jang, Bodhisattwa Prasad Majumder, Julian McAuley, Thomas Lukasiewicz, Oana-Maria Camburu

arXiv:2306.02980v126.4225 citationsh-index: 72Has Code

Originality Incremental advance

AI Analysis

This addresses the reliability of AI explanations for users, but it is incremental as it builds on existing adversarial attacks and mitigation approaches.

The paper tackled the problem of detecting and alleviating inconsistencies in natural language explanations (NLEs) generated by models, showing that higher NLE quality does not correlate with fewer inconsistencies and proposing a mitigation method that reduces inconsistencies by grounding models in external knowledge.

While recent works have been considerably improving the quality of the natural language explanations (NLEs) generated by a model to justify its predictions, there is very limited research in detecting and alleviating inconsistencies among generated NLEs. In this work, we leverage external knowledge bases to significantly improve on an existing adversarial attack for detecting inconsistent NLEs. We apply our attack to high-performing NLE models and show that models with higher NLE quality do not necessarily generate fewer inconsistencies. Moreover, we propose an off-the-shelf mitigation method to alleviate inconsistencies by grounding the model into external background knowledge. Our method decreases the inconsistencies of previous high-performing NLE models as detected by our attack.

View on arXiv PDF Code

Similar