CL AIDec 17, 2024

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi

arXiv:2412.13377v29.111 citationsh-index: 7Has CodeNAACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of temporal biases in LLMs for researchers and developers, but it is incremental as it focuses on benchmarking and analysis without proposing a new method.

The paper tackles the problem of temporal reasoning in large language models by introducing DateLogicQA, a benchmark with 190 questions, and finds that models exhibit biases at representation and logical levels, highlighting challenges in handling temporal data accurately.

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

View on arXiv PDF Code

Similar