AIApr 30

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

arXiv:2604.2804924.8

AI Analysis

For practitioners deploying T2SQL agents in production, STEF solves the problem of evaluating SQL accuracy without schema or reference queries, enabling quality monitoring and improvement loops.

The paper introduces STEF, a schema-agnostic evaluation framework for production Text-to-SQL systems that operates without ground-truth queries or database schema, enabling continuous monitoring and feedback loops. Empirical results show it makes structured query evaluation viable at scale.

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.

View on arXiv PDF

Similar