CLOct 8, 2020

Precise Task Formalization Matters in Winograd Schema Evaluations

arXiv:2010.04043v1997 citations
Originality Synthesis-oriented
AI Analysis

This highlights a critical issue for benchmark creators and users in NLP, showing that reported improvements may be incremental and driven by methodological tweaks rather than model advancements.

The paper investigates the impact of task formalization on performance in the Winograd Schema Challenge, finding that changes like framing as multiple choice improve accuracy by 2-6 points, rather than reflecting gains in reasoning ability.

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination of input specification, loss function, and reuse of pretrained parameters---by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes