Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification
This addresses the need for more rigorous evaluation of LLMs in causal reasoning tasks, though it is incremental as it builds on existing symbolic verification methods.
The authors tackled the problem of evaluating causal reasoning in large language models (LLMs) by proposing DoVerifier, a symbolic verifier that checks formal validity using do-calculus and probability theory, resulting in more accurate semantic correctness assessments on synthetic data and causal QA benchmarks.
Large language models (LLMs) are increasingly being applied to tasks that involve causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that do not capture whether the output of a model is formally valid under the semantics of causal reasoning. To address this, we propose DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers to causal queries that would otherwise be marked incorrect due to superficial differences in their causal semantics. Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning.