CLOct 8, 2020

PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge

arXiv:2010.03725v1998 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating models that incorporate domain knowledge for researchers in natural language processing, though it is incremental as it focuses on a specific domain.

The authors introduced PARADE, a dataset for paraphrase identification that requires computer science domain knowledge, and found that both state-of-the-art models like BERT (F1 score 0.709) and human annotators perform poorly on it.

We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE. For example, BERT after fine-tuning achieves an F1 score of 0.709, which is much lower than its performance on other paraphrase identification datasets. PARADE can serve as a resource for researchers interested in testing models that incorporate domain knowledge. We make our data and code freely available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes