CL AIJan 12

Measuring Iterative Temporal Reasoning with Time Puzzles

arXiv:2601.07148v21 citationsh-index: 3

AI Analysis

This provides a diagnostic tool for assessing tool-augmented iterative temporal reasoning in LLMs, though it is incremental as it builds on existing evaluation methods.

The authors tackled the problem of evaluating iterative temporal reasoning in large language models by introducing Time Puzzles, a constraint-based date inference task, and found that even the best model (GPT-5) achieved only 49.3% accuracy without tools, with all others below 31%.

We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, Time Puzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset's simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, Time Puzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.

View on arXiv PDF

Similar