CL DBApr 8

SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Yixi Zhou, Fan Zhang, Zhiqiao Guo, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie

arXiv:2604.0673695.8h-index: 48Has Code

AI Analysis

This addresses the overlooked dimension of structural evaluation for LLM-based program generation systems, which is incremental but important for reliability.

The paper tackles the problem of structural reliability in LLM-generated SQL queries, showing that modern LLMs often produce structurally diverse queries even when execution results are correct, and that a compile-style pipeline can improve both execution accuracy and structural consistency.

Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

View on arXiv PDF

Similar