CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality
This addresses a gap for researchers and developers in natural language processing by providing a practical dataset for Chinese TableQA systems, though it is incremental as it builds on existing data-to-text frameworks.
The authors tackled the lack of large-scale, high-quality datasets for data-to-text tasks in non-English languages by introducing CATS, a Chinese answer-to-sequence dataset with 100,000 examples, and proposed a Unified Graph Transformation method that improved performance by 15% over baselines.
There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored. To alleviate these limitations, in this paper, we present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality. The dataset aims to generate textual descriptions for the answer in the practical TableQA system. Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources and convert this task to a graph-to-text problem. The experiment results demonstrate the effectiveness of our proposed method. Further analysis on CATS attests to both the high quality and challenges of the dataset.