PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
This addresses the need for better evaluation of paraphrase detection models in NLP, though it is incremental as it builds on existing datasets and methods.
The authors tackled the problem of simplistic paraphrase detection in NLP by creating PARAPHRASUS, a comprehensive benchmark for multi-dimensional evaluation, which reveals trade-offs in models and includes 3 challenges across over 10 datasets.
The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we create PARAPHRASUS, a benchmark designed for multi-dimensional assessment, benchmarking and selection of paraphrase detection models. We find that paraphrase detection models under our fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset. Furthermore, PARAPHRASUS allows prompt calibration for different use cases, tailoring LLM models to specific strictness levels. PARAPHRASUS includes 3 challenges spanning over 10 datasets, including 8 repurposed and 2 newly annotated; we release it along with a benchmarking library at https://github.com/impresso/paraphrasus