CLMay 6, 2020

Shape of synth to come: Why we should use synthetic data for English surface realization

arXiv:2005.02693v1997 citations
AI Analysis

This addresses the challenge of data scarcity in surface realization for NLP researchers, advocating for a policy change to encourage synthetic data use, though it is incremental as it builds on existing methods and datasets.

The paper tackles the problem of surface realization in Natural Language Generation by showing that using synthetic data improves a state-of-the-art system by almost 8 BLEU points on the English 2018 dataset, contrary to prior findings that led to a ban on synthetic data in shared tasks.

The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language Generation shared tasks with the goal of exploring approaches to surface realization from Universal-Dependency-like trees to surface strings for several languages. In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data, and a new rule prohibiting the use of synthetic data was introduced for the 2019 shared task. Contrary to the findings of the 2018 shared task, we show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect - an improvement of almost 8 BLEU points for a previously state-of-the-art system. We analyse the effects of synthetic data, and we argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes