BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
This work provides a new benchmark for evaluating reasoning in LLMs, addressing a gap in objective assessment for symbolic and temporal tasks, though it is incremental as it builds on existing evaluation frameworks.
The authors introduced BaziQA-Benchmark, a standardized benchmark derived from 200 multiple-choice problems to evaluate symbolic and temporally compositional reasoning in large language models, finding that models consistently outperform chance but remain far from saturation with systematic failures on tasks like precise temporal localization.
We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.