CL AI DBNov 11, 2021

A Chinese Multi-type Complex Questions Answering Dataset over Wikidata

Jianyun Zou, Min Yang, Lichao Zhang, Yechen Xu, Qifan Pan, Fengqing Jiang, Ran Qin, Shushu Wang, Yifan He, Songfang Huang, Zhou Zhao

arXiv:2111.06086v11.411 citations

Originality Synthesis-oriented

AI Analysis

This addresses a gap in KBQA research by providing resources for Chinese language and complex question types, though it is incremental as it extends existing dataset efforts to a new language and knowledge base.

The authors tackled the lack of large-scale complex Chinese datasets for knowledge base question answering (KBQA) over Wikidata by introducing CLC-QuAD, a dataset with multi-type complex questions, and developed a baseline model that effectively answers these questions using Wikidata as knowledge.

Complex Knowledge Base Question Answering is a popular area of research in the past decade. Recent public datasets have led to encouraging results in this field, but are mostly limited to English and only involve a small number of question types and relations, hindering research in more realistic settings and in languages other than English. In addition, few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases. We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges. Together with the dataset, we present a text-to-SPARQL baseline model, which can effectively answer multi-type complex questions, such as factual questions, dual intent questions, boolean questions, and counting questions, with Wikidata as the background knowledge. We finally analyze the performance of SOTA KBQA models on this dataset and identify the challenges facing Chinese KBQA.

View on arXiv PDF

Similar