Jean-Pierre Hubaux

CR
15papers
1,031citations
Novelty49%
AI Score27

15 Papers

CRSep 6, 2022
Orchestrating Collaborative Cybersecurity: A Secure Framework for Distributed Privacy-Preserving Threat Intelligence Sharing

Juan R. Trocoso-Pastoriza, Alain Mermoud, Romain Bouyé et al.

Cyber Threat Intelligence (CTI) sharing is an important activity to reduce information asymmetries between attackers and defenders. However, this activity presents challenges due to the tension between data sharing and confidentiality, that result in information retention often leading to a free-rider problem. Therefore, the information that is shared represents only the tip of the iceberg. Current literature assumes access to centralized databases containing all the information, but this is not always feasible, due to the aforementioned tension. This results in unbalanced or incomplete datasets, requiring the use of techniques to expand them; we show how these techniques lead to biased results and misleading performance expectations. We propose a novel framework for extracting CTI from distributed data on incidents, vulnerabilities and indicators of compromise, and demonstrate its use in several practical scenarios, in conjunction with the Malware Information Sharing Platforms (MISP). Policy implications for CTI sharing are presented and discussed. The proposed system relies on an efficient combination of privacy enhancing technologies and federated processing. This lets organizations stay in control of their CTI and minimize the risks of exposure or leakage, while enabling the benefits of sharing, more accurate and representative results, and more effective predictive and preventive defenses.

CRMay 24, 2021
Every Byte Matters: Traffic Analysis of Bluetooth Wearable Devices

Ludovic Barman, Alexandre Dumur, Apostolos Pyrgelis et al.

Wearable devices such as smartwatches, fitness trackers, and blood-pressure monitors process, store, and communicate sensitive and personal information related to the health, life-style, habits and interests of the wearer. This data is exchanged with a companion app running on a smartphone over a Bluetooth connection. In this work, we investigate what can be inferred from the metadata (such as the packet timings and sizes) of encrypted Bluetooth communications between a wearable device and its connected smartphone. We show that a passive eavesdropper can use traffic-analysis attacks to accurately recognize (a) communicating devices, even without having access to the MAC address, (b) human actions (e.g., monitoring heart rate, exercising) performed on wearable devices ranging from fitness trackers to smartwatches, (c) the mere opening of specific applications on a Wear OS smartwatch (e.g., the opening of a medical app, which can immediately reveal a condition of the wearer), (d) fine-grained actions (e.g., recording an insulin injection) within a specific application that helps diabetic users to monitor their condition, and (e) the profile and habits of the wearer by continuously monitoring her traffic over an extended period. We run traffic-analysis attacks by collecting a dataset of Bluetooth traces of multiple wearable devices, by designing features based on packet sizes and timings, and by using machine learning to classify the encrypted traffic to actions performed by the wearer. Then, we explore standard defense strategies; we show that these defenses do not provide sufficient protection against our attacks and introduce significant costs. Our research highlights the need to rethink how applications exchange sensitive information over Bluetooth, to minimize unnecessary data exchanges, and to design new defenses against traffic-analysis tailored to the wearable setting.

CRMar 16, 2021
SoK: Privacy-Preserving Collaborative Tree-based Model Learning

Sylvain Chatel, Apostolos Pyrgelis, Juan Ramon Troncoso-Pastoriza et al.

Tree-based models are among the most efficient machine learning techniques for data mining nowadays due to their accuracy, interpretability, and simplicity. The recent orthogonal needs for more data and privacy protection call for collaborative privacy-preserving solutions. In this work, we survey the literature on distributed and privacy-preserving training of tree-based models and we systematize its knowledge based on four axes: the learning algorithm, the collaborative model, the protection mechanism, and the threat model. We use this to identify the strengths and limitations of these works and provide for the first time a framework analyzing the information leakage occurring in distributed tree-based model learning.

CRJan 21, 2021
Privacy-Preserving and Efficient Verification of the Outcome in Genome-Wide Association Studies

Anisa Halimi, Leonard Dervishi, Erman Ayday et al.

Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. Workflow systems model and record provenance describing the steps performed to obtain the final results of a computation. In this work, we propose a framework that verifies the correctness of the statistical test results that are conducted by a researcher while protecting individuals' privacy in the researcher's dataset. The researcher publishes the workflow of the conducted study, its output, and associated metadata. They keep the research dataset private while providing, as part of the metadata, a partial noisy dataset (that achieves local differential privacy). To check the correctness of the workflow output, a verifier makes use of the workflow, its metadata, and results of another statistical study (using publicly available datasets) to distinguish between correct statistics and incorrect ones. We use case the proposed framework in the genome-wide association studies (GWAS), in which the goal is to identify highly associated point mutations (variants) with a given phenotype. For evaluation, we use real genomic data and show that the correctness of the workflow output can be verified with high accuracy even when the aggregate statistics of a small number of variants are provided. We also quantify the privacy leakage due to the provided workflow and its associated metadata in the GWAS use-case and show that the additional privacy risk due to the provided metadata does not increase the existing privacy risk due to sharing of the research results. Thus, our results show that the workflow output (i.e., research results) can be verified with high confidence in a privacy-preserving way. We believe that this work will be a valuable step towards providing provenance in a privacy-preserving way while providing guarantees to the users about the correctness of the results.

CROct 27, 2020
Revolutionizing Medical Data Sharing Using Advanced Privacy Enhancing Technologies: Technical, Legal and Ethical Synthesis

James Scheibner, Jean Louis Raisaro, Juan Ramón Troncoso-Pastoriza et al.

Multisite medical data sharing is critical in modern clinical practice and medical research. The challenge is to conduct data sharing that preserves individual privacy and data usability. The shortcomings of traditional privacy-enhancing technologies mean that institutions rely on bespoke data sharing contracts. These contracts increase the inefficiency of data sharing and may disincentivize important clinical treatment and medical research. This paper provides a synthesis between two novel advanced privacy enhancing technologies (PETs): Homomorphic Encryption and Secure Multiparty Computation (defined together as Multiparty Homomorphic Encryption or MHE). These PETs provide a mathematical guarantee of privacy, with MHE providing a performance advantage over separately using HE or SMC. We argue MHE fulfills legal requirements for medical data sharing under the General Data Protection Regulation (GDPR) which has set a global benchmark for data protection. Specifically, the data processed and shared using MHE can be considered anonymized data. We explain how MHE can reduce the reliance on customized contractual measures between institutions. The proposed approach can accelerate the pace of medical research whilst offering additional incentives for healthcare and research institutes to employ common data interoperability standards.

CRSep 1, 2020
POSEIDON: Privacy-Preserving Federated Neural Network Learning

Sinem Sav, Apostolos Pyrgelis, Juan R. Troncoso-Pastoriza et al.

In this paper, we address the problem of privacy-preserving training and evaluation of neural networks in an $N$-party, federated learning setting. We propose a novel system, POSEIDON, the first of its kind in the regime of privacy-preserving neural network training. It employs multiparty lattice-based cryptography to preserve the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to $N-1$ parties. To efficiently execute the secure backpropagation algorithm for training neural networks, we provide a generic packing approach that enables Single Instruction, Multiple Data (SIMD) operations on encrypted data. We also introduce arbitrary linear transformations within the cryptographic bootstrapping operation, optimizing the costly cryptographic computations over the parties, and we define a constrained optimization problem for choosing the cryptographic parameters. Our experimental results show that POSEIDON achieves accuracy similar to centralized or decentralized non-private approaches and that its computation and communication overhead scales linearly with the number of parties. POSEIDON trains a 3-layer neural network on the MNIST dataset with 784 features and 60K samples distributed among 10 parties in less than 2 hours.

CRJul 8, 2020
Privacy and Integrity Preserving Computations with CRISP

Sylvain Chatel, Apostolos Pyrgelis, Juan R. Troncoso-Pastoriza et al.

In the digital era, users share their personal data with service providers to obtain some utility, e.g., access to high-quality services. Yet, the induced information flows raise privacy and integrity concerns. Consequently, cautious users may want to protect their privacy by minimizing the amount of information they disclose to curious service providers. Service providers are interested in verifying the integrity of the users' data to improve their services and obtain useful knowledge for their business. In this work, we present a generic solution to the trade-off between privacy, integrity, and utility, by achieving authenticity verification of data that has been encrypted for offloading to service providers. Based on lattice-based homomorphic encryption and commitments, as well as zero-knowledge proofs, our construction enables a service provider to process and reuse third-party signed data in a privacy-friendly manner with integrity guarantees. We evaluate our solution on different use cases such as smart-metering, disease susceptibility, and location-based activity tracking, thus showing its versatility. Our solution achieves broad generality, quantum-resistance, and relaxes some assumptions of state-of-the-art solutions without affecting performance.

CRMay 25, 2020
Decentralized Privacy-Preserving Proximity Tracing

Carmela Troncoso, Mathias Payer, Jean-Pierre Hubaux et al.

This document describes and analyzes a system for secure and privacy-preserving proximity tracing at large scale. This system, referred to as DP3T, provides a technological foundation to help slow the spread of SARS-CoV-2 by simplifying and accelerating the process of notifying people who might have been exposed to the virus so that they can take appropriate measures to break its transmission chain. The system aims to minimise privacy and security risks for individuals and communities and guarantee the highest level of data protection. The goal of our proximity tracing system is to determine who has been in close physical proximity to a COVID-19 positive person and thus exposed to the virus, without revealing the contact's identity or where the contact occurred. To achieve this goal, users run a smartphone app that continually broadcasts an ephemeral, pseudo-random ID representing the user's phone and also records the pseudo-random IDs observed from smartphones in close proximity. When a patient is diagnosed with COVID-19, she can upload pseudo-random IDs previously broadcast from her phone to a central server. Prior to the upload, all data remains exclusively on the user's phone. Other users' apps can use data from the server to locally estimate whether the device's owner was exposed to the virus through close-range physical proximity to a COVID-19 positive person who has uploaded their data. In case the app detects a high risk, it will inform the user.

CRMay 19, 2020
Scalable Privacy-Preserving Distributed Learning

David Froelicher, Juan R. Troncoso-Pastoriza, Apostolos Pyrgelis et al.

In this paper, we address the problem of privacy-preserving distributed learning and the evaluation of machine-learning models by analyzing it in the widespread MapReduce abstraction that we extend with privacy constraints. We design SPINDLE (Scalable Privacy-preservINg Distributed LEarning), the first distributed and privacy-preserving system that covers the complete ML workflow by enabling the execution of a cooperative gradient-descent and the evaluation of the obtained model and by preserving data and model confidentiality in a passive-adversary model with up to N-1 colluding parties. SPINDLE uses multiparty homomorphic encryption to execute parallel high-depth computations on encrypted data without significant overhead. We instantiate SPINDLE for the training and evaluation of generalized linear models on distributed datasets and show that it is able to accurately (on par with non-secure centrally-trained models) and efficiently (due to a multi-level parallelization of the computations) train models that require a high number of iterations on large input data with thousands of features, distributed among hundreds of data providers. For instance, it trains a logistic-regression model on a dataset of one million samples with 32 features distributed among 160 data providers in less than three minutes.

CRFeb 11, 2019
Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets

David Froelicher, Juan R. Troncoso-Pastoriza, Joao Sa Sousa et al.

Data sharing has become of primary importance in many domains such as big-data analytics, economics and medical research, but remains difficult to achieve when the data are sensitive. In fact, sharing personal information requires individuals' unconditional consent or is often simply forbidden for privacy and security reasons. In this paper, we propose Drynx, a decentralized system for privacy-conscious statistical analysis on distributed datasets. Drynx relies on a set of computing nodes to enable the computation of statistics such as standard deviation or extrema, and the training and evaluation of machine-learning models on sensitive and distributed data. To ensure data confidentiality and the privacy of the data providers, Drynx combines interactive protocols, homomorphic encryption, zero-knowledge proofs of correctness, and differential privacy. It enables an efficient and decentralized verification of the input data and of all the system's computations thus provides auditability in a strong adversarial model in which no entity has to be individually trusted. Drynx is highly modular, dynamic and parallelizable. Our evaluation shows that it enables the training of a logistic regression model on a dataset (12 features and 600,000 records) distributed among 12 data providers in less than 2 seconds. The computations are distributed among 6 computing nodes, and Drynx enables the verification of the query execution's correctness in less than 22 seconds.

CRJun 8, 2018
Reducing Metadata Leakage from Encrypted Files and Communication with PURBs

Kirill Nikitin, Ludovic Barman, Wouter Lueks et al.

Most encrypted data formats leak metadata via their plaintext headers, such as format version, encryption schemes used, number of recipients who can decrypt the data, and even the recipients' identities. This leakage can pose security and privacy risks to users, e.g., by revealing the full membership of a group of collaborators from a single encrypted e-mail, or by enabling an eavesdropper to fingerprint the precise encryption software version and configuration the sender used. We propose that future encrypted data formats improve security and privacy hygiene by producing $\textit{Padded Uniform Random Blobs}$ or PURBs: ciphertexts indistinguishable from random bit strings to anyone without a decryption key. A PURB's content leaks $\textit{nothing at all}$, even the application that created it, and is padded such that even its length leaks as little as possible. Encoding and decoding ciphertexts with $\textit{no}$ cleartext markers presents efficiency challenges, however. We present cryptographically agile encodings enabling legitimate recipients to decrypt a PURB efficiently, even when encrypted for any number of recipients' public keys and/or passwords, and when these public keys are from different cryptographic suites. PURBs employ Padmé, a~novel padding scheme that limits information leakage via ciphertexts of maximum length $M$ to a practical optimum of $O(\log \log M)$ bits, comparable to padding to a power of two, but with lower overhead of at most $12\%$ and decreasing with larger payloads.

CROct 27, 2017
PriFi: Low-Latency Anonymity for Organizational Networks

Ludovic Barman, Italo Dacosta, Mahdi Zamani et al.

Organizational networks are vulnerable to traffic-analysis attacks that enable adversaries to infer sensitive information from the network traffic - even if encryption is used. Typical anonymous communication networks are tailored to the Internet and are poorly suited for organizational networks. We present PriFi, an anonymous communication protocol for LANs, which protects users against eavesdroppers and provides high-performance traffic-analysis resistance. PriFi builds on Dining Cryptographers networks but reduces the high communication latency of prior work via a new client/relay/server architecture, in which a client's packets remain on their usual network path without additional hops, and in which a set of remote servers assist the anonymization process without adding latency. PriFi also solves the challenge of equivocation attacks, which are not addressed by related works, by encrypting the traffic based on the communication history. Our evaluation shows that PriFi introduces a small latency overhead (~100ms for 100 clients) and is compatible with delay-sensitive applications such as VoIP.

CRSep 5, 2014
Prolonging the Hide-and-Seek Game: Optimal Trajectory Privacy for Location-Based Services

George Theodorakopoulos, Reza Shokri, Carmela Troncoso et al.

Human mobility is highly predictable. Individuals tend to only visit a few locations with high frequency, and to move among them in a certain sequence reflecting their habits and daily routine. This predictability has to be taken into account in the design of location privacy preserving mechanisms (LPPMs) in order to effectively protect users when they continuously expose their position to location-based services (LBSs). In this paper, we describe a method for creating LPPMs that are customized for a user's mobility profile taking into account privacy and quality of service requirements. By construction, our LPPMs take into account the sequential correlation across the user's exposed locations, providing the maximum possible trajectory privacy, i.e., privacy for the user's present location, as well as past and expected future locations. Moreover, our LPPMs are optimal against a strategic adversary, i.e., an attacker that implements the strongest inference attack knowing both the LPPM operation and the user's mobility profile. The optimality of the LPPMs in the context of trajectory privacy is a novel contribution, and it is achieved by formulating the LPPM design problem as a Bayesian Stackelberg game between the user and the adversary. An additional benefit of our formal approach is that the design parameters of the LPPM are chosen by the optimization algorithm.

CRMay 8, 2014
Privacy in the Genomic Era

Muhammad Naveed, Erman Ayday, Ellen W. Clayton et al.

Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward.

CRJun 5, 2013
The Chills and Thrills of Whole Genome Sequencing

Erman Ayday, Emiliano De Cristofaro, Jean-Pierre Hubaux et al.

In recent years, Whole Genome Sequencing (WGS) evolved from a futuristic-sounding research project to an increasingly affordable technology for determining complete genome sequences of complex organisms, including humans. This prompts a wide range of revolutionary applications, as WGS promises to improve modern healthcare and provide a better understanding of the human genome -- in particular, its relation to diseases and response to treatments. However, this progress raises worrisome privacy and ethical issues, since, besides uniquely identifying its owner, the genome contains a treasure trove of highly personal and sensitive information. In this article, after summarizing recent advances in genomics, we discuss some important privacy issues associated with human genomic information and identify a number of particularly relevant research challenges.