Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships
This addresses the need for reliable causal reasoning in high-stakes applications like health and law, but it is incremental as it focuses on benchmarking rather than improving models.
The authors tackled the problem of evaluating causal reasoning in Large Language Models (LLMs) by introducing a novel benchmark based on scientifically validated relationships from economics and finance, revealing that the best model achieved only 57.6% accuracy, highlighting significant limitations.
Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.