CLAIApr 6, 2022

DAGAM: Data Augmentation with Generation And Modification

arXiv:2204.02633v12 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This work addresses data scarcity issues in text classification, but it is incremental as it combines existing generation and modification techniques.

The authors tackled underfitting in large pre-trained language models for text classification by proposing three data augmentation schemes: generation (DAG), modification (DAM), and their combination (DAGAM), which improved performance on six benchmark datasets compared to using original datasets.

Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained language models, under-fitting often occurs due to the size of the model being very large compared to the amount of available training data. Along with significant importance of data collection in modern machine learning paradigm, studies have been actively conducted for natural language data augmentation. In light of this, we introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models. Primarily we use a generation model for data augmentation, which is defined as Data Augmentation with Generation (DAG). Next, we augment data using text modification techniques such as corruption and word order change (Data Augmentation with Modification, DAM). Finally, we propose Data Augmentation with Generation And Modification (DAGAM), which combines DAG and DAM techniques for a boosted performance. We conduct data augmentation for six benchmark datasets of text classification task, and verify the usefulness of DAG, DAM, and DAGAM through BERT-based fine-tuning and evaluation, deriving better results compared to the performance with original datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes