CLSep 16, 2025

HistoryBankQA: Multilingual Temporal Question Answering on Historical Events

arXiv:2509.12720v1h-index: 3Has Code
Originality Incremental advance
AI Analysis

This provides a comprehensive resource for advancing multilingual temporal understanding of historical events in NLP, though it is incremental as it builds on existing temporal reasoning work.

The authors tackled the lack of multilingual temporal reasoning benchmarks for historical events by creating HistoryBank, a database of 10M+ events from Wikipedia in 10 languages, and a QA benchmark covering 6 tasks, where GPT4o performed best and Gemma-2 outperformed other small models.

Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes