Learning to Recombine and Resample Data for Compositional Generalization
This addresses the challenge of compositional generalization for neural models in language processing, offering a method to improve performance without relying on symbolic scaffolding, though it is incremental as it builds on existing data augmentation techniques.
The paper tackles the problem of poor compositional generalization in neural sequence models by introducing R&R, a learned data augmentation scheme that recombines and resamples training data, enabling models to learn new constructions and tenses from as few as eight examples in tasks like SCAN and SIGMORPHON 2018.
Flexible neural sequence models outperform grammar- and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data -- particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems -- instruction following (SCAN) and morphological analysis (SIGMORPHON 2018) -- where R&R enables learning of new constructions and tenses from as few as eight initial examples.