CLAIJun 8, 2025

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

arXiv:2506.06955v41 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the need for reliable LLMs in high-stakes domains like law and healthcare, where truth must override belief, though it is incremental as it builds on prior reasoning benchmarks.

The authors tackled the problem of evaluating belief-inconsistent reasoning in large language models (LLMs) by creating BIS Reasoning 1.0, a large-scale Japanese dataset of syllogistic reasoning problems, and found that GPT-4o achieved 79.54% accuracy, revealing significant weaknesses in LLMs when handling logically valid but belief-conflicting inputs.

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes