CLAINov 20, 2020

What do we expect from Multiple-choice QA Systems?

arXiv:2011.10647v11004 citations
AI Analysis

This work highlights a gap between model performance and human expectations for language understanding in MCQA systems, which is important for researchers developing more robust and interpretable NLP models.

This paper investigates a top-performing Multiple Choice Question Answering (MCQA) model using zero-information perturbations and finds that it fails to meet human expectations for language understanding. The authors propose a modified training approach that forces the model to better attend to inputs, resulting in a new model that performs comparably to the original while better satisfying expectations.

The recent success of machine learning systems on various QA datasets could be interpreted as a significant improvement in models' language understanding abilities. However, using various perturbations, multiple recent works have shown that good performance on a dataset might not indicate performance that correlates well with human's expectations from models that "understand" language. In this work we consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets, and evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs. Our results show that the model clearly falls short of our expectations, and motivates a modified training approach that forces the model to better attend to the inputs. We show that the new training paradigm leads to a model that performs on par with the original model while better satisfying our expectations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes