GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing
This addresses the data scarcity problem for researchers in computational linguistics working on Chinese discourse parsing, though it is incremental as it extends existing RST frameworks to a new language.
The authors tackled the lack of large-scale annotated data for hierarchical discourse parsing in Chinese by creating GCDT, the largest Mandarin Chinese RST treebank with over 60K tokens across five genres, and achieved state-of-the-art parsing scores for Chinese and English datasets through cross-lingual training.
A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.