Asterios Katsifodimos

DB
h-index100
6papers
154citations
Novelty38%
AI Score41

6 Papers

35.0DBMay 28
The Missing Dimensions in Geo-Distributed Database Evaluation

Oto Mraz, Kyriakos Psarakis, George Christodoulou et al.

Geo-distributed OLTP databases are widely deployed across cloud regions, yet current evaluation practices do not cover the challenges of this aspect. Existing benchmarks assume stable network conditions; they lack explicit settings for data and client locality, and they largely ignore data transfer costs across regions. In addition, most evaluations rely on a limited set of geo-distribution patterns. In this paper, we propose Gaia, a comprehensive evaluation framework that addresses these gaps. We use Gaia to perform a comprehensive evaluation of existing geo-distributed OLTP systems. We deploy them across multiple cloud regions, using different geo-distribution patterns and variable cross-region network conditions. Among other interesting findings, our framework reveals that: i) most systems are sensitive to network instabilities, ii) network costs dominate cloud deployment expenses iii) multi-region fault-tolerance mechanisms incur measurable critical-path overhead that is often overlooked in prior evaluations. We argue that for the design of future geo-distributed databases, we must rethink the trade-offs between performance, fault-tolerance, and cost.

LGJul 19, 2022
Metadata Representations for Queryable ML Model Zoos

Ziyu Li, Rihan Hai, Alessandro Bozzon et al.

Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is currently not standardised; its expressivity is limited; and there is no interoperable way to store and query it. Consequently, model search, reuse, comparison, and composition are hindered. In this paper, we advocate for standardized ML model meta-data representation and management, proposing a toolkit supported to help practitioners manage and query that metadata.

IRSep 2, 2024
LLM-PQA: LLM-enhanced Prediction Query Answering

Ziyu Li, Wenjie Zhao, Asterios Katsifodimos et al.

The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.

DCAug 3, 2020Code
A Survey on the Evolution of Stream Processing Systems

Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri et al.

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'22) streaming systems, and discuss recent trends and open problems.

DBFeb 1, 2025
CoddLLM: Empowering Large Language Models for Data Analytics

Jiani Zhang, Hengrui Zhang, Rishav Chakravarti et al. · amazon-science, cmu

Large Language Models (LLMs) have the potential to revolutionize data analytics by simplifying tasks such as data discovery and SQL query synthesis through natural language interactions. This work serves as a pivotal first step toward the development of foundation models explicitly designed for data analytics applications. To propel this vision forward, we unveil a new data recipe for post-training LLMs, enhancing their comprehension of data management and empowering them to tackle complex real-world analytics tasks. Specifically, our innovative approach includes a scalable synthetic data generation method that enables the creation of a broad spectrum of topics centered on data representation and manipulation. Furthermore, we introduce two new tasks that seamlessly bridge tables and text. We show that such tasks can enhance models' understanding of schema creation and the nuanced translation between natural language and tabular data. Leveraging this data recipe, we post-train a new foundation model, named CoddLLM, based on Mistral-NeMo-12B. To assess the language understanding and reasoning capabilities of LLMs in the realm of data analytics, we contribute AnalyticsMMLU, a benchmark containing thousands of multiple-choice questions on databases, data analysis, and machine learning. Our focus on data discovery, has resulted in the contribution of three comprehensive benchmarks that address both database and data lake scenarios. CoddLLM not only excels in performance but also sets a new standard, achieving the highest average accuracy across eight datasets. It outperforms GPT-3.5-Turbo on AnalyticsMMLU, exceeding GPT-4o by 12.1% in table selection and showing an average improvement of 24.9% in Text-to-SQL compared to the base model.

DBJun 13, 2024
FeatNavigator: Automatic Feature Augmentation on Tabular Data

Jiaming Liang, Chuan Lei, Xiao Qin et al.

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.