CL IRMay 13

Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

arXiv:2605.2284397.8Has Code

Predicted impact top 4% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners needing Text-to-SQL in low-resource domains, this framework improves generalization and robustness by injecting domain knowledge into training and inference.

The paper proposes a knowledge-aware Text-to-SQL framework that constructs a task-specific knowledge base (schema semantics, abbreviations, business logic, query patterns) to generate diverse synthetic training data and enhance inference via knowledge retrieval. Experiments on seven benchmarks show substantial performance improvements for both open-source and closed-source LLMs, especially in low-resource domain-specific settings.

Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{<question, SQL>} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.

View on arXiv PDF

Similar