CL AIJun 8, 2025

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi

arXiv:2506.06955v42.71 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses the need for reliable LLMs in high-stakes domains like law and healthcare, where truth must override belief, though it is incremental as it builds on prior reasoning benchmarks.

The authors tackled the problem of evaluating belief-inconsistent reasoning in large language models (LLMs) by creating BIS Reasoning 1.0, a large-scale Japanese dataset of syllogistic reasoning problems, and found that GPT-4o achieved 79.54% accuracy, revealing significant weaknesses in LLMs when handling logically valid but belief-conflicting inputs.

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

View on arXiv PDF

Similar