ARPA: Armenian Paraphrase Detection Corpus and Models
This addresses the problem of limited NLP resources for low-resource languages like Armenian, though it is incremental as it applies existing methods to a new domain.
The authors tackled the lack of a paraphrase detection corpus for Armenian by creating one using a semi-automatic back-translation method, resulting in a dataset of 2360 paraphrases and training BERT-based models that achieve results comparable to state-of-the-art in other languages.
In this work, we employ a semi-automatic method based on back translation to generate a sentential paraphrase corpus for the Armenian language. The initial collection of sentences is translated from Armenian to English and back twice, resulting in pairs of lexically distant but semantically similar sentences. The generated paraphrases are then manually reviewed and annotated. Using the method train and test datasets are created, containing 2360 paraphrases in total. In addition, the datasets are used to train and evaluate BERTbased models for detecting paraphrase in Armenian, achieving results comparable to the state-of-the-art of other languages.