CLNov 5, 2020

MEGA RST Discourse Treebanks with Structure and Nuclearity from Scalable Distant Sentiment Supervision

arXiv:2011.03017v11000 citations
AI Analysis

This addresses the problem of limited data for discourse parsing in NLP, enabling more robust data-driven approaches, though it is incremental as it builds on existing distant supervision techniques.

The authors tackled the lack of large discourse treebanks for RST-style parsing by developing a scalable method to automatically generate MEGA-DT, a new large-scale corpus using distant sentiment supervision, resulting in a parser trained on it showing promising inter-domain performance gains compared to those trained on human-annotated corpora.

The lack of large and diverse discourse treebanks hinders the application of data-driven approaches, such as deep-learning, to RST-style discourse parsing. In this work, we present a novel scalable methodology to automatically generate discourse treebanks using distant supervision from sentiment-annotated datasets, creating and publishing MEGA-DT, a new large-scale discourse-annotated corpus. Our approach generates discourse trees incorporating structure and nuclearity for documents of arbitrary length by relying on an efficient heuristic beam-search strategy, extended with a stochastic component. Experiments on multiple datasets indicate that a discourse parser trained on our MEGA-DT treebank delivers promising inter-domain performance gains when compared to parsers trained on human-annotated discourse corpora.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes