What Makes a Good Paraphrase: Do Automated Evaluations Work?
This addresses the problem of reliable paraphrase evaluation for NLP researchers, but it appears incremental as it focuses on a specific dataset and existing evaluation methods.
The paper investigates what constitutes a good paraphrase and whether automated metrics can effectively evaluate paraphrase quality, using experiments on a German dataset with both automatic and expert linguistic evaluations.
Paraphrasing is the task of expressing an essential idea or meaning in different words. But how different should the words be in order to be considered an acceptable paraphrase? And can we exclusively use automated metrics to evaluate the quality of a paraphrase? We attempt to answer these questions by conducting experiments on a German data set and performing automatic and expert linguistic evaluation.