RuBQ: A Russian Dataset for Question Answering over Wikidata
This provides a resource for Russian NLP researchers, but it is incremental as it adapts existing dataset creation methods to a new language.
The authors tackled the lack of a Russian dataset for knowledge base question answering by creating RuBQ, which includes 1,500 Russian questions with translations, SPARQL queries, and verified answers.
The paper presents RuBQ, the first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.