CLFeb 11, 2025

A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

arXiv:2502.07188v111 citationsh-index: 1NAACL
Originality Synthesis-oriented
AI Analysis

This provides a crucial resource for researchers and developers working on Vietnamese NLP tasks, though it is incremental as it applies existing methods to a new language-specific dataset.

The paper tackles the lack of large-scale resources for Vietnamese sentence paraphrasing by introducing ViSP, a dataset of 1.2M high-quality paraphrase pairs, and evaluates it with various methods including LLMs, establishing a foundational benchmark.

This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes