CL AIJan 13, 2025

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos, Hugo Abonizio, Rodrigo Nogueira

arXiv:2501.07482v215.58 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more balanced global representation in LLM training, though it is incremental as it builds on existing factual recall benchmarks.

The authors tackled the problem of assessing how large language models (LLMs) retain knowledge of global events over time and across regions by introducing the Timely Events Benchmark (TiEBe), a dataset with over 23,000 question-answer pairs spanning 10+ years, 23 regions, and 13 languages, and found significant geographic disparities in recall and a correlation over 0.7 with socioeconomic indicators.

As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.

View on arXiv PDF Code

Similar