CLApr 28, 2020

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

arXiv:2004.13645v133 citations
AI Analysis

This addresses the data collection bottleneck for NLP developers, offering a practical framework that could reduce reliance on costly human annotations, though it appears incremental as an adaptation of existing transfer methods to language.

The paper tackles the challenge of needing large human-annotated datasets for NLP by introducing a simulation-to-real transfer technique that uses only synthetic training data to interpret natural utterances, matching or outperforming state-of-the-art models trained on natural data in several domains.

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes