A German Corpus for Text Similarity Detection Tasks
This work provides a domain-specific resource for researchers and practitioners in natural language processing focusing on German text similarity, but it is incremental as it applies existing methods to new data.
The authors tackled the lack of a German corpus for text similarity detection by presenting a new textual German corpus designed to automatically assess similarity between texts and evaluate various similarity measures, both for whole documents and individual sentences, with results including the calculation of several simple measures based on a library of similarity functions.
Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions.