ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
For researchers evaluating LLM reasoning, this benchmark exposes critical failure modes like prior collapse and inconsistency that standard accuracy metrics miss.
ChaosBench-Logic v2 introduces a 40,886-question benchmark over 165 dynamical systems to evaluate LLM logical reasoning, revealing that regime-transition reasoning remains near random (MCC=0.05) even for frontier models, while FOL deduction reaches MCC=0.52.
Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.