CLOct 13, 2020

F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering

arXiv:2010.06283v1998 citations
Originality Incremental advance
AI Analysis

This work addresses the user experience gap in explainable AI for question answering, offering incremental improvements in model design and evaluation for better real-world applicability.

The paper tackles the problem of weak coupling between answers and explanations in explainable question answering systems, which harms user experience; the proposed hierarchical model and new evaluation scores improve users' ability to judge system correctness, with user studies showing these scores align better with practical usefulness than traditional metrics like F1.

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes