LG AI PL MLDec 27, 2019

Synthetic Datasets for Neural Program Synthesis

Richard Shin, Neel Kant, Kavi Gupta, Christopher Bender, Brandon Trabucco, Rishabh Singh, Dawn Song

arXiv:1912.12345v115.345 citations

Originality Incremental advance

AI Analysis

This addresses generalization issues in program synthesis for researchers working with domain-specific languages, though it appears incremental as it builds on existing test input generation techniques.

The paper tackles the problem of poor generalization in neural program synthesis when using synthetic datasets with control flow and rich input spaces, and demonstrates that training deep networks on carefully controlled synthetic data distributions improves cross-distribution generalization performance in two domain-specific languages.

The goal of program synthesis is to automatically generate programs in a particular language from corresponding specifications, e.g. input-output behavior. Many current approaches achieve impressive results after training on randomly generated I/O examples in limited domain-specific languages (DSLs), as with string transformations in RobustFill. However, we empirically discover that applying test input generation techniques for languages with control flow and rich input space causes deep networks to generalize poorly to certain data distributions; to correct this, we propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications. We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.

View on arXiv PDF

Similar