CLOct 13, 2020

F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

Hendrik Schuff, Heike Adel, Ngoc Thang Vu

arXiv:2010.06283v131.1998 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the user experience gap in explainable AI for question answering, offering incremental improvements in model design and evaluation for better real-world applicability.

The paper tackles the problem of weak coupling between answers and explanations in explainable question answering systems, which harms user experience; the proposed hierarchical model and new evaluation scores improve users' ability to judge system correctness, with user studies showing these scores align better with practical usefulness than traditional metrics like F1.

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.

View on arXiv PDF Code

Similar