CLAug 7, 2023

WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset

arXiv:2308.03582v2133 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This provides a resource for diachronic NLP to help models detect updates in concepts, events, or entities, but it is incremental as it builds on existing dataset creation methods.

The authors tackled the challenge of identifying changes in language or world knowledge for NLP models by creating WikiTiDe, a dataset of timestamped definition pairs from Wikipedia, which improved fine-tuned models and showed promising results in downstream tasks.

A fundamental challenge in the current NLP context, dominated by language models, comes from the inflexibility of current architectures to 'learn' new information. While model-centric solutions like continual learning or parameter-efficient fine tuning are available, the question still remains of how to reliably identify changes in language or in the world. In this paper, we propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia. We argue that such resource can be helpful for accelerating diachronic NLP, specifically, for training models able to scan knowledge resources for core updates concerning a concept, an event, or a named entity. Our proposed end-to-end method is fully automatic, and leverages a bootstrapping algorithm for gradually creating a high-quality dataset. Our results suggest that bootstrapping the seed version of WikiTiDe leads to better fine-tuned models. We also leverage fine-tuned models in a number of downstream tasks, showing promising results with respect to competitive baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes