AICLMay 19, 2025

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Peking U
arXiv:2505.12891v49 citationsh-index: 26Has Code
Originality Synthesis-oriented
AI Analysis

This provides a standardized evaluation tool for temporal reasoning in LLMs, addressing real-world challenges like intensive information and complex dependencies, but it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of benchmarks for temporal reasoning in real-world scenarios by proposing TIME, a multi-level benchmark with 38,522 QA pairs across 3 levels and 11 sub-tasks, and they analyzed performance across diverse scenarios and test-time scaling effects.

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME , and the project page link is https://sylvain-wei.github.io/TIME/ .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes