CLAILGSEFeb 23, 2025

SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

arXiv:2502.16747v25 citationsh-index: 20Proceedings of the 4th Table Representation Learning Workshop
Originality Incremental advance
AI Analysis

This addresses the challenge of handling complex, real-world database schemas in NL2SQL for AI practitioners, though it is incremental as it builds on existing LLM methods.

The paper tackles the problem of large language models (LLMs) performing poorly on Natural Language to SQL (NL2SQL) tasks with large database schemas by introducing SQLong, a data augmentation framework that generates synthetic schemas and data to simulate long-context scenarios, resulting in significant performance improvements on Spider and BIRD datasets.

Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes