CLNov 1, 2024

GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains

arXiv:2411.00491v126 citationsh-index: 11EMNLP
Originality Synthesis-oriented
AI Analysis

This provides a more accessible and varied dataset for researchers in natural language processing, though it is incremental as it builds on existing frameworks.

The authors tackled the lack of an open, diverse dataset for English shallow discourse parsing by creating a new multi-genre benchmark based on the UD English GUM corpus, showing that joint training with existing data reduces out-of-domain degradation.

Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes