Amr El Abbadi

DB
h-index74
9papers
180citations
Novelty51%
AI Score36

9 Papers

DBAug 1, 2024
Hybrid Querying Over Relational Databases and Large Language Models

Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi

Database queries traditionally operate under the closed-world assumption, providing no answers to questions that require information beyond the data stored in the database. Hybrid querying using SQL offers an alternative by integrating relational databases with large language models (LLMs) to answer beyond-database questions. In this paper, we present the first cross-domain benchmark, SWAN, containing 120 beyond-database questions over four real-world databases. To leverage state-of-the-art language models in addressing these complex questions in SWAN, we present two solutions: one based on schema expansion and the other based on user defined functions. We also discuss optimization opportunities and potential future directions. Our evaluation demonstrates that using GPT-4 Turbo with few-shot prompts, one can achieves up to 40.0\% in execution accuracy and 48.2\% in data factuality. These results highlights both the potential and challenges for hybrid querying. We believe that our work will inspire further research in creating more efficient and accurate data systems that seamlessly integrate relational databases and large language models to address beyond-database questions.

DBDec 16, 2023
LLM-SQL-Solver: Can LLMs Determine SQL Equivalence?

Fuheng Zhao, Jiayue Chen, Lawrence Lim et al.

Judging the equivalence between two SQL queries is a fundamental problem with many practical applications in data management and SQL generation (i.e., evaluating the quality of generated SQL queries in text-to-SQL task). While the research community has reasoned about SQL equivalence for decades, it poses considerable difficulties and no complete solutions exist. Recently, Large Language Models (LLMs) have shown strong reasoning capability in conversation, question answering and solving mathematics challenges. In this paper, we study if LLMs can be used to determine the equivalence between SQL queries under two notions of SQL equivalence (semantic equivalence and relaxed equivalence). To assist LLMs in generating high quality responses, we present two prompting techniques: Miniature & Mull and Explain & Compare. The former technique is used to evaluate the semantic equivalence in which it asks LLMs to execute a query on a simple database instance and then explore if a counterexample exists by modifying the database. The latter technique is used to evaluate the relaxed equivalence in which it asks LLMs to explain the queries and then compare if they contain significant logical differences. Our experiments demonstrate using our techniques, LLMs is a promising tool to help data engineers in writing semantically equivalent SQL queries, however challenges still persist, and is a better metric for evaluating SQL generation than the popular execution accuracy.

DBAug 30, 2025
Access Paths for Efficient Ordering with Large Language Models

Fuheng Zhao, Jiayue Chen, Yiming Pan et al.

We present the LLM ORDER BY operator as a logical abstraction and study its physical implementations within a unified evaluation framework. Our experiments show that no single approach is universally optimal, with effectiveness depending on query characteristics and data. We introduce three new designs: an agreement-based batch-size policy, a majority voting mechanism for pairwise sorting, and a two-way external merge sort adapted for LLMs. With extensive experiments, our agreement-based procedure is effective at determining batch size for value-based methods, the majority-voting mechanism consistently strengthens pairwise comparisons on GPT-4o, and external merge sort achieves high accuracy-efficiency trade-offs across datasets and models. We further observe a log-linear scaling between compute cost and ordering quality, offering the first step toward principled cost models for LLM powered data systems.

DBMay 3, 2020
SEPAR: Towards Regulating Future of Work Multi-Platform Crowdworking Environments with Privacy Guarantees

Mohammad Javad Amiri, Joris Duguépéroux, Tristan Allard et al.

Crowdworking platforms provide the opportunity for diverse workers to execute tasks for different requesters. The popularity of the "gig" economy has given rise to independent platforms that provide competing and complementary services. Workers as well as requesters with specific tasks may need to work for or avail from the services of multiple platforms resulting in the rise of multi-platform crowdworking systems. Recently, there has been increasing interest by governmental, legal and social institutions to enforce regulations, such as minimal and maximal work hours, on crowdworking platforms. Platforms within multi-platform crowdworking systems, therefore, need to collaborate to enforce cross-platform regulations. While collaborating to enforce global regulations requires the transparent sharing of information about tasks and their participants, the privacy of all participants needs to be preserved. In this paper, we propose an overall vision exploring the regulation, privacy, and architecture dimensions for the future of work multi-platform crowdworking environments. We then present SEPAR, a multi-platform crowdworking system that enforces a large sub-space of practical global regulations on a set of distributed independent platforms in a privacy-preserving manner. SEPAR, enforces privacy using lightweight and anonymous tokens, while transparency is achieved using fault-tolerant blockchains shared across multiple platforms. The privacy guarantees of SEPAR against covert adversaries are formalized and thoroughly demonstrated, while the experiments reveal the efficiency of SEPAR in terms of performance and scalability.

DBJan 20, 2020
Fides: Managing Data on Untrusted Infrastructure

Sujaya Maiyya, Danny Hyun Bum Cho, Divyakant Agrawal et al.

Significant amounts of data are currently being stored and managed on third-party servers. It is impractical for many small scale enterprises to own their private datacenters, hence renting third-party servers is a viable solution for such businesses. But the increasing number of malicious attacks, both internal and external, as well as buggy software on third-party servers is causing clients to lose their trust in these external infrastructures. While small enterprises cannot avoid using external infrastructures, they need the right set of protocols to manage their data on untrusted infrastructures. In this paper, we propose TFCommit, a novel atomic commitment protocol that executes transactions on data stored across multiple untrusted servers. To our knowledge, TFCommit is the first atomic commitment protocol to execute transactions in an untrusted environment without using expensive Byzantine replication. Using TFCommit, we propose an auditable data management system, Fides, residing completely on untrustworthy infrastructure. As an auditable system, Fides guarantees the detection of potentially malicious failures occurring on untrusted servers using tamper-resistant logs with the support of cryptographic techniques. The experimental evaluation demonstrates the scalability and the relatively low overhead of our approach that allows executing transactions on untrusted infrastructure.

SIMay 23, 2019
Multifaceted Privacy: How to Express Your Online Persona without Revealing Your Sensitive Attributes

Victor Zakhary, Ishani Gupta, Rey Tang et al.

Recent works in social network stream analysis show that a user's online persona attributes (e.g., gender, ethnicity, political interest, location, etc.) can be accurately inferred from the topics the user writes about or engages with. Attribute and preference inferences have been widely used to serve personalized recommendations, directed ads, and to enhance the user experience in social networks. However, revealing a user's sensitive attributes could represent a privacy threat to some individuals. Microtargeting (e.g.,Cambridge Analytica scandal), surveillance, and discriminating ads are examples of threats to user privacy caused by sensitive attribute inference. In this paper, we propose Multifaceted privacy, a novel privacy model that aims to obfuscate a user's sensitive attributes while publicly preserving the user's public persona. To achieve multifaceted privacy, we build Aegis, a prototype client-centric social network stream processing system that helps preserve multifaceted privacy, and thus allowing social network users to freely express their online personas without revealing their sensitive attributes of choice. Aegis allows social network users to control which persona attributes should be publicly revealed and which ones should be kept private. For this, Aegis continuously suggests topics and hashtags to social network users to post in order to obfuscate their sensitive attributes and hence confuse content-based sensitive attribute inferences. The suggested topics are carefully chosen to preserve the user's publicly revealed persona attributes while hiding their private sensitive persona attributes. Our experiments show that adding as few as 0 to 4 obfuscation posts (depending on how revealing the original post is) successfully hides the user specified sensitive attributes without changing the user's public persona attributes.

DBMay 22, 2019
Towards Global Asset Management in Blockchain Systems

Victor Zakhary, Mohammad Javad Amiri, Sujaya Maiyya et al.

Permissionless blockchains (e.g., Bitcoin, Ethereum, etc) have shown a wide success in implementing global scale peer-to-peer cryptocurrency systems. In such blockchains, new currency units are generated through the mining process and are used in addition to transaction fees to incentivize miners to maintain the blockchain. Although it is clear how currency units are generated and transacted on, it is unclear how to use the infrastructure of permissionless blockchains to manage other assets than the blockchain's currency units (e.g., cars, houses, etc). In this paper, we propose a global asset management system by unifying permissioned and permissionless blockchains. A governmental permissioned blockchain authenticates the registration of end-user assets through smart contract deployments on a permissionless blockchain. Afterwards, end-users can transact on their assets through smart contract function calls (e.g., sell a car, rent a room in a house, etc). In return, end-users get paid in currency units of the same blockchain or other blockchains through atomic cross-chain transactions and governmental offices receive taxes on these transactions in cryptocurrency units.

DBMay 8, 2019
Atomic Commitment Across Blockchains

Victor Zakhary, Divyakant Agrawal, Amr El Abbadi

The recent adoption of blockchain technologies and open permissionless networks suggest the importance of peer-to-peer atomic cross-chain transaction protocols. Users should be able to atomically exchange tokens and assets without depending on centralized intermediaries such as exchanges. Recent peer-to-peer atomic cross-chain swap protocols use hashlocks and timelocks to ensure that participants comply to the protocol. However, an expired timelock could lead to a violation of the all-or-nothing atomicity property. An honest participant who fails to execute a smart contract on time due to a crash failure or network delays at her site might end up losing her assets. Although a crashed participant is the only participant who ends up worse off, current proposals are unsuitable for atomic cross-chain transactions in asynchronous environments where crash failures and network delays are the norm. In this paper, we present AC3WN, the first decentralized all-or-nothing atomic cross-chain commitment protocol. The redeem and refund events of the smart contracts that exchange assets are modeled as conflicting events. An open permissionless network of witnesses is used to guarantee that conflicting events could never simultaneously occur and either all smart contracts in an atomic cross-chain transaction are redeemed or all of them are refunded.

IRJul 16, 2018
A Distributed Collaborative Filtering Algorithm Using Multiple Data Sources

Mohamed Reda Bouadjenek, Esther Pacitti, Maximilien Servajean et al.

Collaborative Filtering (CF) is one of the most commonly used recommendation methods. CF consists in predicting whether, or how much, a user will like (or dislike) an item by leveraging the knowledge of the user's preferences as well as that of other users. In practice, users interact and express their opinion on only a small subset of items, which makes the corresponding user-item rating matrix very sparse. Such data sparsity yields two main problems for recommender systems: (1) the lack of data to effectively model users' preferences, and (2) the lack of data to effectively model item characteristics. However, there are often many other data sources that are available to a recommender system provider, which can describe user interests and item characteristics (e.g., users' social network, tags associated to items, etc.). These valuable data sources may supply useful information to enhance a recommendation system in modeling users' preferences and item characteristics more accurately and thus, hopefully, to make recommenders more precise. For various reasons, these data sources may be managed by clusters of different data centers, thus requiring the development of distributed solutions. In this paper, we propose a new distributed collaborative filtering algorithm, which exploits and combines multiple and diverse data sources to improve recommendation quality. Our experimental evaluation using real datasets shows the effectiveness of our algorithm compared to state-of-the-art recommendation algorithms.