CLLGSep 21, 2024

Temporally Consistent Factuality Probing for Large Language Models

arXiv:2409.14065v225 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the need for LLMs to be factually consistent over time, which is crucial for their reliable use as knowledge sources, representing an incremental improvement in evaluation and training methods.

The study tackled the problem of evaluating and improving the temporally consistent factuality of Large Language Models (LLMs) by introducing TeCFaP, a novel probing task, and found that most LLMs performed poorly on it, with the proposed CoTSeLF solution demonstrating efficacy over baselines.

The prolific use of Large Language Models (LLMs) as an alternate knowledge base requires them to be factually consistent, necessitating both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Factuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factuality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes