Yannis Chronis

DB
h-index36
4papers
8citations
Novelty44%
AI Score43

4 Papers

33.2DBMar 24
An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis]

Duo Lu, Helena Caminal, Manos Chatzakis et al.

Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research most often evaluates algorithms in specialized libraries, making optimistic assumptions that do not align with enterprise-grade database systems. Our work challenges this premise by demonstrating that in a production-grade database system, commonly made assumptions do not hold, leading to performance characteristics and algorithmic trade-offs that are fundamentally different from those observed in isolated library settings. This paper presents the first in-depth analysis of filter-agnostic FVS algorithms within a production PostgreSQL-compatible system. We systematically evaluate post-filtering and inline-filtering strategies across a wide range of selectivities and correlations. Our central finding is that the optimal algorithm is not dictated by the cost of distance computations alone, but that system-level overheads that come from both distance computations and filter operations (like page accesses and data retrieval) play a significant role. We demonstrate that graph-based approaches (such as NaviX/ACORN) can incur prohibitive numbers of filter checks and system-level overheads, compared with clustering-based indexes such as ScaNN, often canceling out their theoretical benefits in real-world database environments. Ultimately, our findings provide the database community with crucial insights and practical guidelines, demonstrating that the optimal choice for a filter-agnostic FVS algorithm is not absolute, but rather a system-aware decision contingent on the interplay between workload characteristics and the underlying costs of data access in a real-world database architecture.

DBAug 28, 2024
CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Yannis Chronis, Yawen Wang, Yu Gan et al.

Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.

44.8DBMay 15
To GPU or Not to GPU: Vector Search in Relational Engines

Vasilis Mageirakos, Joel André, Marko Kabić et al.

Vector search (VS) is now available in most database engines. However, while vector search is a common feature in AI/ML/LLMs where the dominant computing platforms are GPUs, existing database engines operate on CPUs even when implementing vector search. This raises the question of whether integrating vector processing on GPUs as part of the engine would be a better design. In this paper, we explore this question in detail. First, we extend the TPC-H benchmark with vector data (from text and images) and propose a number of representative SQL+VS queries. Second, we develop a modular execution engine that can run SQL+VS queries across CPU and GPU. Third, we perform extensive experiments on a number of deployments: running the SQL+VS queries across CPU and/or GPU, with data residing in CPU or GPU memory, with existing indices and novel, optimized versions, as well as across different GPUs and interconnects (PCIe, NVLink). The results provide actionable and counter-intuitive insights on how to run such queries over CPUs and GPUs. For instance, the relational components benefit much more from running on the GPU than the vector search part. In addition, when the vector search involves moving data and indexes, using the GPU is not the best option, even with fast interconnects. Thus, we develop an alternative organization of vector index and embeddings that reduces the size of the index, making GPU-based vector search more competitive. With these improvements, the final result is that both the relational and vector search components are faster on the GPU, particularly on fast interconnects, in contrast with the architecture used in existing engines.

DBOct 3, 2025
Is it Bigger than a Breadbox: Efficient Cardinality Estimation for Real World Workloads

Zixuan Yi, Sami Abu-el-Haija, Yawen Wang et al.

DB engines produce efficient query execution plans by relying on cost models. Practical implementations estimate cardinality of queries using heuristics, with magic numbers tuned to improve average performance on benchmarks. Empirically, estimation error significantly grows with query complexity. Alternatively, learning-based estimators offer improved accuracy, but add operational complexity preventing their adoption in-practice. Recognizing that query workloads contain highly repetitive subquery patterns, we learn many simple regressors online, each localized to a pattern. The regressor corresponding to a pattern can be randomly-accessed using hash of graph structure of the subquery. Our method has negligible overhead and competes with SoTA learning-based approaches on error metrics. Further, amending PostgreSQL with our method achieves notable accuracy and runtime improvements over traditional methods and drastically reduces operational costs compared to other learned cardinality estimators, thereby offering the most practical and efficient solution on the Pareto frontier. Concretely, simulating JOB-lite workload on IMDb speeds-up execution by 7.5 minutes (>30%) while incurring only 37 seconds overhead for online learning.