CLMay 10, 2022

ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

arXiv:2205.04651v1651 citationsh-index: 36
AI Analysis

This provides a scalable resource for multilingual NLP tasks, though it is incremental as it builds on existing translation-based paraphrase generation methods.

The authors tackled the problem of creating multilingual paraphrase corpora by developing ParaCotta, a method that generates synthetic parallel paraphrases across 17 languages using only monolingual data and neural machine translation. The result is a corpus with paraphrase pairs that are semantically similar and lexically diverse, as evaluated against ParaBank2.

We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes