CLMar 24, 2021

Finnish Paraphrase Corpus

Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Jenna Saarni, Maija Sevón, Otto Tarkka

arXiv:2103.13103v131.8728 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides a high-quality resource for NLP tasks in Finnish, though it is incremental as it applies existing methods to a new language domain.

The authors tackled the lack of a manually annotated paraphrase corpus for Finnish by creating one with 53,572 paraphrase pairs from subtitles and news headings, achieving 98% manual classification accuracy for paraphrase context.

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.

View on arXiv PDF Code

Similar