AIMar 17

V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi

arXiv:2603.1658183.1h-index: 9Has Code

AI Analysis

This addresses the issue of time-sensitive knowledge updates in VLMs for AI researchers and developers, but it is incremental as it focuses on benchmarking and analysis rather than proposing a new solution.

The authors tackled the problem of outdated factual knowledge in Vision-Language Models (VLMs) due to static training data, by introducing V-DyKnow, a benchmark for evaluating time-sensitive knowledge, and found that VLMs frequently output outdated facts, with reliability degrading from textual to visual stimuli and existing alignment methods failing to update knowledge consistently across modalities.

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

View on arXiv PDF

Similar