Annotation Guidelines for the Turku Paraphrase Corpus
This provides a resource for natural language processing tasks in Finnish, but it is incremental as it extends existing paraphrase annotation methods to a new language.
The paper tackles the problem of creating a paraphrase corpus for Finnish by developing annotation guidelines with a 1-4 scale and subcategories, resulting in the annotation of over 100,000 paraphrase pairs.
This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus. These guidelines were developed together with the corpus annotation, revising and extending the guidelines regularly during the annotation work. Our paraphrase annotation scheme uses the base scale 1-4, where labels 1 and 2 are used for negative candidates (not paraphrases), while labels 3 and 4 are paraphrases at least in the given context if not everywhere. In addition to base labeling, the scheme is enriched with additional subcategories (flags) for categorizing different types of paraphrases inside the two positive labels, making the annotation scheme suitable for more fine-grained paraphrase categorization. The annotation scheme is used to annotate over 100,000 Finnish paraphrase pairs.