CLFeb 16

The Wikidata Query Logs Dataset

arXiv:2602.14594v11.11 citationsh-index: 23

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for researchers in knowledge graph question-answering, though it is incremental as it builds on existing dataset formats.

The authors tackled the lack of large-scale real-world datasets for Wikidata question-answering by constructing the Wikidata Query Logs (WDQL) dataset with 200k question-query pairs, which is over 6x larger than existing datasets, and demonstrated its utility for training methods.

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

View on arXiv PDF

Similar