Learning to Synthesize Data for Semantic Parsing
This work addresses data scarcity for semantic parsing, offering a method to generate more diverse training data without handcrafted rules, though it is incremental as it builds on existing PCFG and BART techniques.
The paper tackled the problem of synthesizing diverse data for semantic parsing by proposing a generative model combining a PCFG for program composition and a BART-based translation model, which improved compositional and domain generalization in text-to-SQL parsing on GeoQuery and Spider benchmarks.
Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to a better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.