LGAICLMLNov 5, 2018

How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

arXiv:1811.01778v21038 citations
AI Analysis

This work addresses the reliability of common-sense reasoning benchmarks for researchers and practitioners, highlighting potential issues in evaluating AI systems.

The paper investigates whether improved performance on common-sense reasoning benchmarks like Winograd Schema Challenge and SWAG reflects genuine progress, by analyzing threats to validity in experimental designs and accounting for properties such as size limitations and structural regularities.

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes