CLMay 26, 2023

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Kuan-Hao Huang, Varun Iyer, I-Hung Hsu, Anoop Kumar, Kai-Wei Chang, Aram Galstyan

arXiv:2305.16585v126.6228 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for scalable and diverse paraphrase data for NLP tasks, though it is incremental as it builds on existing back-translation methods with a new representation.

The authors tackled the problem of generating syntactically diverse paraphrases by creating ParaAMR, a large-scale dataset using abstract meaning representation back-translation, which showed improved syntactic diversity while maintaining semantic similarity in evaluations.

Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity -- the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.

View on arXiv PDF Code

Similar