AIJan 13

T3: Benchmarking Sycophancy and Skepticism in Causal Judgment

arXiv:2601.08258v13 citationsh-index: 2

Originality Incremental advance

AI Analysis

This provides a rigorous evaluation tool for LLM causal reasoning, addressing a specific domain problem with incremental methodological improvements.

The researchers tackled the problem of evaluating LLM causal judgment by introducing T3, a diagnostic benchmark with 454 expert-curated vignettes, which revealed that safety-tuned models like Claude Haiku reject 60% of valid links and GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals.

We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl's Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we diagnose two distinct pathologies: a "Skepticism Trap" at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3. In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification.

View on arXiv PDF

Similar