CLAIJun 2, 2021

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

arXiv:2106.00969v1721 citationsHas Code
Originality Incremental advance
AI Analysis

This provides a more comprehensive benchmark for evaluating commonsense reasoning in AI, addressing limitations in existing datasets, though it is incremental in improving assessment reliability.

The authors tackled the challenge of reliably assessing commonsense reasoning in AI by introducing COM2SENSE, a benchmark dataset with 4k complementary sentence pairs, where the strongest baseline achieved ~71% standard accuracy and ~51% pairwise accuracy, far below human performance (~95%).

Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI). Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets. However, the reliability and comprehensiveness of these benchmarks towards assessing model's commonsense reasoning ability remains unclear. To this end, we introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs. We propose a pairwise accuracy metric to reliably measure an agent's ability to perform commonsense reasoning over a given situation. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, we design our dataset along the dimensions of knowledge domains, reasoning scenarios and numeracy. Experimental results demonstrate that our strongest baseline (UnifiedQA-3B), after fine-tuning, achieves ~71% standard accuracy and ~51% pairwise accuracy, well below human performance (~95% for both metrics). The dataset is available at https://github.com/PlusLabNLP/Com2Sense.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes