AIJan 13

T3: Benchmarking Sycophancy and Skepticism in Causal Judgment

arXiv:2601.08258v13 citationsh-index: 2
Originality Incremental advance
AI Analysis

This provides a rigorous evaluation tool for LLM causal reasoning, addressing a specific domain problem with incremental methodological improvements.

The researchers tackled the problem of evaluating LLM causal judgment by introducing T3, a diagnostic benchmark with 454 expert-curated vignettes, which revealed that safety-tuned models like Claude Haiku reject 60% of valid links and GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals.

We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl's Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we diagnose two distinct pathologies: a "Skepticism Trap" at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3. In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes