CLJun 13, 2024

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

arXiv:2406.09170v191 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work provides a benchmark for researchers to assess LLM temporal reasoning, though it is incremental as it builds on existing evaluation methods with new datasets.

The authors tackled the problem of evaluating LLMs on temporal reasoning by introducing synthetic datasets to avoid pre-training contamination and factual inconsistencies, finding systematic insights into LLM performance across various factors.

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes