Controllable Data Augmentation for Context-Dependent Text-to-SQL
This work addresses the data scarcity problem for text-to-SQL systems, particularly in handling complex, context-dependent queries, though it is incremental as it builds on existing augmentation methods.
The paper tackles the limited diversity in data augmentation for context-dependent text-to-SQL models by introducing ConDA, which generates interactive questions and SQL results using SQL dialogue state transitions and a filter for quality control, resulting in an average improvement of 3.3% on complex questions in SParC and CoSQL datasets.
The limited scale of annotated data constraints existing context-dependent text-to-SQL models because of the complexity of labeling. The data augmentation method is a commonly used method to solve this problem. However, the data generated by current augmentation methods often lack diversity. In this paper, we introduce ConDA, which generates interactive questions and corresponding SQL results. We designed the SQL dialogue state to enhance the data diversity through the state transition. Meanwhile, we also present a filter method to ensure the data quality by a grounding model. Additionally, we utilize a grounding model to identify and filter low-quality questions that mismatch the state information. Experimental results on the SParC and CoSQL datasets show that ConDA boosts the baseline model to achieve an average improvement of $3.3\%$ on complex questions. Moreover, we analyze the augmented data, which reveals that the data generated by ConDA are of high quality in both SQL template hardness and types, turns, and question consistency.