CLAIDBNov 11, 2021

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

arXiv:2111.06086v111 citations
Originality Synthesis-oriented
AI Analysis

This addresses a gap in KBQA research by providing resources for Chinese language and complex question types, though it is incremental as it extends existing dataset efforts to a new language and knowledge base.

The authors tackled the lack of large-scale complex Chinese datasets for knowledge base question answering (KBQA) over Wikidata by introducing CLC-QuAD, a dataset with multi-type complex questions, and developed a baseline model that effectively answers these questions using Wikidata as knowledge.

Complex Knowledge Base Question Answering is a popular area of research in the past decade. Recent public datasets have led to encouraging results in this field, but are mostly limited to English and only involve a small number of question types and relations, hindering research in more realistic settings and in languages other than English. In addition, few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases. We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges. Together with the dataset, we present a text-to-SPARQL baseline model, which can effectively answer multi-type complex questions, such as factual questions, dual intent questions, boolean questions, and counting questions, with Wikidata as the background knowledge. We finally analyze the performance of SOTA KBQA models on this dataset and identify the challenges facing Chinese KBQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes