AILGFeb 14, 2025

Do Large Language Models Reason Causally Like Us? Even Better?

arXiv:2502.10215v27 citationsh-index: 30CogSci
Originality Incremental advance
AI Analysis

This addresses the problem of understanding whether LLMs reason causally like humans, with implications for AI evaluation and cognitive science, though it is incremental in comparing existing models.

The study compared causal reasoning in humans and four large language models (LLMs) using collider graph tasks, finding that LLMs like GPT-4o, Gemini-Pro, and Claude often performed more normatively aligned than humans, while GPT-3.5 was often nonsensical, but even top models did not fully capture subtler patterns like 'explaining away'.

Causal reasoning is a core component of intelligence. Large language models (LLMs) have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. LLMs' causal inferences ranged from often nonsensical (GPT-3.5) to human-like to often more normatively aligned than those of humans (GPT-4o, Gemini-Pro, and Claude). Computational model fitting showed that one reason for GPT-4o, Gemini-Pro, and Claude's superior performance is they didn't exhibit the "associative bias" that plagues human causal reasoning. Nevertheless, even these LLMs did not fully capture subtler reasoning patterns associated with collider graphs, such as "explaining away".

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes