CLAug 6, 2024

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

arXiv:2408.03256v185 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving text-to-SQL performance for open-source models, which is incremental as it builds on existing synthetic data and preference learning methods.

The paper tackles the performance gap between open-source and closed-source LLMs in text-to-SQL tasks by introducing a synthetic data approach that combines outputs from strong and weak models, resulting in SENSE, a specialized model that achieves state-of-the-art results on SPIDER and BIRD benchmarks.

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes