LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
This benchmark addresses the need for evaluating compositional commonsense reasoning in AI, exposing fundamental limitations in models, though it is incremental as it builds on existing commonsense benchmarks.
The authors tackled the problem of commonsense reasoning by introducing LOGICAL-COMMONSENSEQA, a benchmark that evaluates logical composition over pairs of statements using operators like AND, OR, and NEITHER/NOR, finding that models perform reasonably on conjunctive and disjunctive reasoning but degrade sharply on negation-based questions.
Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.