CLAISep 20, 2024

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

arXiv:2409.13338v38 citationsh-index: 5
AI Analysis

This addresses the issue of temporal reasoning in LLMs for real-world applications, but it is incremental as it focuses on benchmarking and identifying limitations rather than proposing a new solution.

The paper tackled the problem of large language models lacking time awareness in fact recall by introducing a novel framework and dataset with over 8,000 events from 2018 to 2024, revealing that base models often outperform instruction-tuned ones on time-sensitive tasks and highlighting brittleness in handling paraphrased facts.

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes