ABD: Default Exception Abduction in Finite First Order Worlds
This work addresses the problem of evaluating logical reasoning capabilities in large language models for researchers in AI and formal methods, presenting a novel benchmark with incremental improvements in testing methodology.
The authors introduced ABD, a benchmark for default-exception abduction in finite first-order worlds, requiring models to output first-order formulas that define exceptions to restore satisfiability while keeping them sparse. They evaluated ten frontier LLMs on 600 instances, finding that the best models achieved high validity but had parsimony gaps, with holdout evaluation revealing distinct generalization failure modes across three observation regimes.
We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions sparse. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier LLMs on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.