CLMay 6, 2020

Shape of synth to come: Why we should use synthetic data for English surface realization

Henry Elder, Robert Burke, Alexander O'Connor, Jennifer Foster

arXiv:2005.02693v131.0997 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of data scarcity in surface realization for NLP researchers, advocating for a policy change to encourage synthetic data use, though it is incremental as it builds on existing methods and datasets.

The paper tackles the problem of surface realization in Natural Language Generation by showing that using synthetic data improves a state-of-the-art system by almost 8 BLEU points on the English 2018 dataset, contrary to prior findings that led to a ban on synthetic data in shared tasks.

The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language Generation shared tasks with the goal of exploring approaches to surface realization from Universal-Dependency-like trees to surface strings for several languages. In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data, and a new rule prohibiting the use of synthetic data was introduced for the 2019 shared task. Contrary to the findings of the 2018 shared task, we show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect - an improvement of almost 8 BLEU points for a previously state-of-the-art system. We analyse the effects of synthetic data, and we argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.

View on arXiv PDF Code

Similar