CLDec 22, 2021

CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models

arXiv:2112.11941v3598 citations
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for assessing counterfactual reasoning in AI, which is incremental as it builds on existing evaluation methods but introduces a specific dataset.

The authors tackled the problem of evaluating counterfactual reasoning in large language models by introducing the CRASS dataset and benchmark, which tests six state-of-the-art models and reveals significant room for improvement compared to a human baseline.

We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. We present the data set design and benchmark that supports scoring against a crowd-validated human baseline. We test six state-of-the-art models against our benchmark. Our results show that it poses a valid challenge for these models and opens up considerable room for their improvement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes