CLJan 8

EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

arXiv:2601.04875v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of limited dataset availability for Text-to-SQL models, which is incremental as it builds on existing synthesis methods by adding explicit structural control.

The paper tackles the scarcity of high-quality, diverse, and structurally complex datasets for training Text-to-SQL models by introducing EvolSQL, a structure-aware data synthesis framework that evolves SQL queries to increase diversity and complexity, resulting in a 7B model fine-tuned on their data outperforming one trained on a much larger dataset using only 1/18 of the data.

Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes