CLOct 17, 2025

Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

arXiv:2510.15513v14.91 citationsh-index: 5Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses a critical gap in ensuring LLMs are reliable for time-sensitive domains like law, healthcare, and finance, though it is incremental as it builds on existing evaluation and enhancement methods.

The paper tackles the problem of temporal consistency in large language models (LLMs) by introducing a novel benchmark and resource, TEMP-ReCon, to evaluate LLMs across multiple languages, finding that LLMs exhibit insufficient temporal referent consistency. It proposes a reasoning path alignment-based model, UnTRaP, which shows efficacy in enhancing this consistency compared to baseline models.

The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.

View on arXiv PDF

Similar