A New Aligned Simple German Corpus
This work addresses the problem of language accessibility for groups needing simplified German, but it is incremental as it builds on existing alignment methods and datasets.
The authors tackled the lack of accessible German text by creating a new sentence-aligned monolingual corpus for Simple German, achieving an F1-score that surpasses previous work in alignment quality.
"Leichte Sprache", the German counterpart to Simple English, is a regulated language aiming to facilitate complex written language that would otherwise stay inaccessible to different groups of people. We present a new sentence-aligned monolingual corpus for Simple German -- German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods. We evaluate our alignments based on a manually labelled subset of aligned documents. The quality of our sentence alignments, as measured by F1-score, surpasses previous work. We publish the dataset under CC BY-SA and the accompanying code under MIT license.