MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions
This provides a domain-specific resource for NLP researchers working on text simplification, but it is incremental as it builds on existing corpora by focusing on minimal propositions.
The authors compiled a new sentence splitting corpus of 203K aligned complex-simple sentence pairs, where each input is broken into minimal propositions, to facilitate the development of approaches that transform complex sentences into fine-grained, simpler structures for improved downstream processing.
We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.