LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings
This work addresses robustness challenges for NL2SQL systems in dynamic, noisy real-world environments, but it is incremental as it focuses on evaluation rather than proposing new methods.
The paper tackled the problem of evaluating robustness in Natural Language to SQL (NL2SQL) systems by introducing a benchmark with about ten types of perturbations and testing state-of-the-art LLMs like Grok-4.1 and GPT-5.2, finding that models maintain strong performance under many perturbations but degrade notably for surface-level noise and linguistic variation, with surface noise causing larger drops in traditional settings and linguistic variation in agentic ones.
Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.