CLLGApr 8, 2019

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense

arXiv:1904.04365v414 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of constructing difficult datasets for testing commonsense reasoning in AI, though it is incremental as it builds upon the SWAG dataset.

The authors tackled the problem of evaluating commonsense reasoning in AI by creating the CODAH dataset, an adversarially-authored question-answer set, which resulted in a significant performance gap with human accuracy at 95.3% versus the best model at 67.5%.

Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the CODAH dataset, an adversarially-constructed evaluation dataset for testing common sense. CODAH forms a challenging extension to the recently-proposed SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video. To produce a more difficult dataset, we introduce a novel procedure for question acquisition in which workers author questions designed to target weaknesses of state-of-the-art neural question answering systems. Workers are rewarded for submissions that models fail to answer correctly both before and after fine-tuning (in cross-validation). We create 2.8k questions via this procedure and evaluate the performance of multiple state-of-the-art question answering systems on our dataset. We observe a significant gap between human performance, which is 95.3%, and the performance of the best baseline accuracy of 67.5% by the BERT-Large model.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes