CLNov 3, 2020

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

arXiv:2011.01856v11000 citations
Originality Incremental advance
AI Analysis

This work addresses data quality issues in NLP for researchers and practitioners, though it is incremental as it builds on existing graph-based and paraphrase modeling techniques.

The authors tackled the problem of manually labeled NLP datasets suffering from inconsistent labeling and limited size by proposing automatic dataset augmentation methods using graph theory, resulting in more accurate paraphrase models when trained on the enhanced datasets.

Most NLP datasets are manually labeled, so suffer from inconsistent labeling or limited size. We propose methods for automatically improving datasets by viewing them as graphs with expected semantic properties. We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We use structural balance theory to identify likely mislabelings in the graph, and flip their labels. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes