CL AINov 20, 2020

What do we expect from Multiple-choice QA Systems?

arXiv:2011.10647v131.11004 citations

Originality Incremental advance

AI Analysis

This work highlights a gap between model performance and human expectations for language understanding in MCQA systems, which is important for researchers developing more robust and interpretable NLP models.

This paper investigates a top-performing Multiple Choice Question Answering (MCQA) model using zero-information perturbations and finds that it fails to meet human expectations for language understanding. The authors propose a modified training approach that forces the model to better attend to inputs, resulting in a new model that performs comparably to the original while better satisfying expectations.

The recent success of machine learning systems on various QA datasets could be interpreted as a significant improvement in models' language understanding abilities. However, using various perturbations, multiple recent works have shown that good performance on a dataset might not indicate performance that correlates well with human's expectations from models that "understand" language. In this work we consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets, and evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs. Our results show that the model clearly falls short of our expectations, and motivates a modified training approach that forces the model to better attend to the inputs. We show that the new training paradigm leads to a model that performs on par with the original model while better satisfying our expectations.

View on arXiv PDF

Similar