CLFeb 16, 2025

CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

arXiv:2502.11008v19 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better assessment and enhancement of LLMs' counterfactual reasoning abilities, which is crucial for advancing AI causality, though it is incremental in building on existing causal reasoning benchmarks.

The paper tackles the problem of evaluating large language models (LLMs) on counterfactual reasoning, a challenging aspect of causality, by introducing a new benchmark dataset, CounterBench, and showing that most LLMs perform near random guessing, but a novel reasoning paradigm, CoIn, significantly improves their performance.

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes