A Learned Cost Model-based Cross-engine Optimizer for SQL Workloads
This addresses the manual and complex task of engine selection for database administrators, though it is incremental as it builds on existing cost modeling and optimization techniques.
The paper tackles the problem of selecting the best execution engine for SQL queries in lakehouse systems by proposing a cross-engine optimizer with a learned cost model, which reduces average Q-error by up to 12.6% and total workload runtime by up to 30.4% compared to baselines.
Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query computational requirements and an engine capability, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that can automate engine selection for diverse SQL queries through a learned cost model. Optimized with hints, a query plan is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem, and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost. Results on various databases and engines show that using a query optimized logical plan for cost estimation decreases the average Q-error by even 12.6% over using unoptimized plans as input. Moreover, the proposed cross-engine optimizer reduces the total workload runtime by up to 25.2% in a zero-shot setting and 30.4% in a few-shot setting when compared to random routing.