CLJan 2, 2021

Substructure Substitution: Structured Data Augmentation for NLP

arXiv:2101.00411v131.7719 citations

Originality Incremental advance

AI Analysis

This work provides a more consistent and effective data augmentation method for NLP tasks, particularly beneficial for researchers and practitioners working with structured or general NLP problems.

This paper introduces Substructure Substitution (SUB2), a data augmentation technique for NLP that generates new examples by replacing substructures with others of the same label. The method improves performance over original datasets and shows more consistent results across tasks and dataset sizes compared to other augmentation methods.

We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB2 based on constituency parse trees, introducing structure-aware data augmentation methods to general NLP tasks. For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set. Further experiments show that SUB2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset.

View on arXiv PDF

Similar