CLApr 21, 2019

Good-Enough Compositional Data Augmentation

arXiv:1904.09545v41105 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of compositional generalization for researchers and practitioners in NLP, offering a practical, incremental improvement applicable across various tasks.

The paper tackles the problem of improving compositional generalization in sequence models by introducing a simple, model-agnostic data augmentation protocol that constructs synthetic examples by replacing fragments in real data with similar ones. The result includes error rate reductions of up to 87% on SCAN diagnostic tasks and 16% on a semantic parsing task, along with a roughly 1% perplexity reduction for n-gram language models.

We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training examples and replacing (possibly discontinuous) fragments with other fragments that appear in at least one similar environment. The protocol is model-agnostic and useful for a variety of tasks. Applied to neural sequence-to-sequence models, it reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task. Applied to n-gram language models, it reduces perplexity by roughly 1% on small corpora in several languages.

View on arXiv PDF

Similar