CLMay 13

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner

arXiv:2605.1291826.7

AI Analysis

This dataset provides a new benchmark for assessing LLMs' ability to perform causal commonsense reasoning about specific entities, a capability crucial for real-world interaction.

CommonWhy is a dataset of 15,000 why questions for evaluating entity-based causal commonsense reasoning in LLMs. Experiments reveal significant shortcomings in state-of-the-art models, including frequent factual hallucinations and failures in causal reasoning.

To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

View on arXiv PDF

Similar