DBApr 9

SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

arXiv:2604.080214.2
Predicted impact top 87% in DB · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a critical data gap for database researchers and developers of learned query optimizers, offering a scalable and controllable alternative to existing methods that often fail to produce valid or diverse queries.

The paper tackles the problem of generating realistic SQL workloads for training learned query optimizers when real-world data is unavailable due to privacy constraints, by proposing SynQL, a rule-based framework that produces structurally diverse and execution-ready SQL queries, achieving near-maximal topological diversity (Topological Entropy H = 1.53 bits) and enabling cost models with R² ≥ 0.79 on synthetic test sets.

Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL targets the core SQL fragment -- multi-table joins with projections, aggregations, and range predicates -- which dominates analytical workloads. SynQL abandons probabilistic text generation in favour of traversing the live database's foreign-key graph to populate an Abstract Syntax Tree (AST), guaranteeing schema and syntactic validity by construction. A configuration vector $Θ$ provides explicit, parametric control over join topology (Star, Chain, Fork), analytical intensity, and predicate selectivity. Experiments on TPC-H and IMDb show that SynQL produces near-maximally diverse workloads (Topological Entropy $H = 1.53$ bits) and that tree-based cost models trained on the synthetic corpus achieve $R^2 \ge 0.79$ on held-out synthetic test sets with sub-millisecond inference latency, establishing SynQL as an effective foundation for generating training data when production logs are inaccessible.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes