CLFeb 11, 2025

A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

arXiv:2502.07188v116.311 citationsh-index: 1Has CodeNAACL

Originality Synthesis-oriented

AI Analysis

This provides a crucial resource for researchers and developers working on Vietnamese NLP tasks, though it is incremental as it applies existing methods to a new language-specific dataset.

The paper tackles the lack of large-scale resources for Vietnamese sentence paraphrasing by introducing ViSP, a dataset of 1.2M high-quality paraphrase pairs, and evaluates it with various methods including LLMs, establishing a foundational benchmark.

This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.

View on arXiv PDF Code

Similar