CL AIJan 21

Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty

arXiv:2601.15479v11.11 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for safe LLM deployment in high-stakes fields like biomedicine by benchmarking their causal reasoning abilities, though it is incremental as it focuses on evaluation rather than proposing new methods.

The paper tackled the problem of evaluating large language models (LLMs) for pairwise causal discovery from text in biomedical and multi-domain contexts, finding major deficiencies with the best models achieving only 49.57% for causal detection and 47.12% for causal extraction.

The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}

View on arXiv PDF Code

Similar