AINov 13, 2025

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

arXiv:2511.09993v13.3

Originality Incremental advance

AI Analysis

This addresses the challenge of temporal and cultural adaptability in LLMs, which is crucial for applications requiring accurate date conversions across different calendars, though it is incremental as it builds on existing tool-augmented approaches.

The paper tackles the problem of cross-calendar temporal reasoning in large language models (LLMs) by introducing the SPAN benchmark, which reveals that SOTA LLMs achieve only 34.5% average accuracy, and proposes a Time Agent method that improves accuracy to 95.31%.

We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

View on arXiv PDF

Similar