AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
This work provides a top-performing system and identifies systematic failure modes in multi-label causal reasoning, which is significant for researchers working on abductive reasoning and LLM robustness.
The authors developed a three-stage system for Abductive Event Reasoning that achieved first place in the SemEval 2026 Task 12 with an accuracy score of 0.95. Their analysis of 14 models revealed three shared inductive biases in multi-label causal reasoning, leading to a 51% cause-count reduction.
We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51\% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.