CLFeb 20, 2024

EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

arXiv:2402.13372v284 citationsh-index: 3Has CodeLREC
AI Analysis

This addresses the limitation of LLMs in handling minor variations in common-sense reasoning tasks, though it is incremental as it builds on existing WSC benchmarks.

The authors tackled the problem of LLMs struggling with altered Winograd Schema Challenge instances by introducing EvoGrad, a human-in-the-loop platform that created a dynamic dataset of 3,691 instances, where GPT-3.5 achieved 65.0% accuracy compared to human performance of 92.8%.

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes