CLAug 6, 2024

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou

arXiv:2408.03256v124.291 citationsh-index: 42Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving text-to-SQL performance for open-source models, which is incremental as it builds on existing synthetic data and preference learning methods.

The paper tackles the performance gap between open-source and closed-source LLMs in text-to-SQL tasks by introducing a synthetic data approach that combines outputs from strong and weak models, resulting in SENSE, a specialized model that achieves state-of-the-art results on SPIDER and BIRD benchmarks.

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

View on arXiv PDF

Similar