CLAIFeb 21, 2025

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

arXiv:2502.15487v310 citationsh-index: 6Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of LLMs in causal reasoning tasks, which is important for applications requiring interpretive accuracy, though it is incremental as it focuses on dataset creation and benchmarking.

The authors tackled the problem of evaluating explicit causal reasoning in Large Language Models by introducing ExpliCa, a new dataset that integrates causal and temporal relations with human acceptability ratings, and found that even top models struggle to reach 0.80 accuracy while often confounding temporal with causal relations.

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes