Unsupervised Evaluation for Question Answering with Transformers
This provides an unsupervised evaluation method for QA models, which is incremental but useful for dataset development and model assessment.
The paper tackles the problem of automatically evaluating question answering model predictions at inference time without labeled data, achieving 91.37% accuracy on SQuAD and 80.7% on SubjQA by analyzing hidden representations in transformers.
It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalize. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labeled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in the semi-automatic development of QA datasets