CLSep 26, 2019

MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions

arXiv:1909.12131v1998 citations
Originality Synthesis-oriented
AI Analysis

This provides a domain-specific resource for NLP researchers working on text simplification, but it is incremental as it builds on existing corpora by focusing on minimal propositions.

The authors compiled a new sentence splitting corpus of 203K aligned complex-simple sentence pairs, where each input is broken into minimal propositions, to facilitate the development of approaches that transform complex sentences into fine-grained, simpler structures for improved downstream processing.

We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes