Georgios Kellaris

4papers

9citations

Novelty43%

AI Score40

Ranked #99,895 of 201,326 authors (top 50%)#2,551 in CR (top 35%)

4 Papers

21.7CRApr 21

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris et al.

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

24.6LGMay 13

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris et al.

The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world's most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the Sherpa.ai Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared.

CROct 2, 2017

Revealing the Unseen: How to Expose Cloud Usage While Protecting User Privacy

Ata Turk, Mayank Varia, Georgios Kellaris

Cloud users have little visibility into the performance characteristics and utilization of the physical machines underpinning the virtualized cloud resources they use. This uncertainty forces users and researchers to reverse engineer the inner workings of cloud systems in order to understand and optimize the conditions their applications operate. At Massachusetts Open Cloud (MOC), as a public cloud operator, we'd like to expose the utilization of our physical infrastructure to stop this wasteful effort. Mindful that such exposure can be used maliciously for gaining insight into other users workloads, in this position paper we argue for the need for an approach that balances openness of the cloud overall with privacy for each tenant inside of it. We believe that this approach can be instantiated via a novel combination of several security and privacy technologies. We discuss the potential benefits, implications of transparency for cloud systems and users, and technical challenges/possibilities.

CRJun 5, 2017

$\mathcal{E}\text{psolute}$: Efficiently Querying Databases While Providing Differential Privacy

Dmytro Bogatov, Georgios Kellaris, George Kollios et al.

As organizations struggle with processing vast amounts of information, outsourcing sensitive data to third parties becomes a necessity. To protect the data, various cryptographic techniques are used in outsourced database systems to ensure data privacy, while allowing efficient querying. A rich collection of attacks on such systems has emerged. Even with strong cryptography, just communication volume or access pattern is enough for an adversary to succeed. In this work we present a model for differentially private outsourced database system and a concrete construction, $\mathcal{E}\text{psolute}$, that provably conceals the aforementioned leakages, while remaining efficient and scalable. In our solution, differential privacy is preserved at the record level even against an untrusted server that controls data and queries. $\mathcal{E}\text{psolute}$ combines Oblivious RAM and differentially private sanitizers to create a generic and efficient construction. We go further and present a set of improvements to bring the solution to efficiency and practicality necessary for real-world adoption. We describe the way to parallelize the operations, minimize the amount of noise, and reduce the number of network requests, while preserving the privacy guarantees. We have run an extensive set of experiments, dozens of servers processing up to 10 million records, and compiled a detailed result analysis proving the efficiency and scalability of our solution. While providing strong security and privacy guarantees we are less than an order of magnitude slower than range query execution of a non-secure plain-text optimized RDBMS like MySQL and PostgreSQL.