Florian Kerschbaum

CR
h-index7
47papers
1,889citations
Novelty53%
AI Score59

47 Papers

AIAug 28, 2023
Identifying and Mitigating the Security Risks of Generative AI

Clark Barrett, Brad Boyd, Elie Burzstein et al. · berkeley

Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.

LGApr 14, 2023Code
PTW: Pivotal Tuning Watermarking for Pre-Trained Image Generators

Nils Lukas, Florian Kerschbaum

Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability, and our study reveals that watermarking is not robust against an adaptive white-box attacker who controls the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only 200 non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. The source code to reproduce our experiments is available at https://github.com/nilslukas/gan-watermark.

CRAug 21, 2023Code
Backdooring Textual Inversion for Concept Censorship

Yutong Wu, Jie Zhang, Florian Kerschbaum et al.

Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.

CRMay 26
Cloak: Heuristic ORAM Optimization Through Fixed Temporal Distribution

Onur Eren Arpaci, Florian Kerschbaum, Sujaya Maiyya

Encrypted cloud storage can hide data contents but still leak sensitive information through access patterns. ORAM addresses this by hiding access patterns, but existing ORAM systems are too inefficient to deploy in practice. We present Cloak, an oblivious storage system that dramatically improves performance by leveraging a simple, widely observed property of real workloads: temporal locality, where recently accessed items are more likely to be accessed again soon. Instead of trying to make server accesses look perfectly uniform, Cloak makes server traffic follow a fixed, "recentness-biased" pattern and then uses real queries to fill as much of that traffic as possible. When the workload exhibits temporal locality, Cloak achieves overheads as low as $1.1\times$ over a non-oblivious and unencrypted baseline. Importantly, this heuristic affects only performance, not security. We evaluate Cloak on Netflix click-stream and Ethereum transaction traces, achieving 165,000 and 157,000 operations per second, respectively, on a single machine.

CRMay 2, 2022
The Limits of Word Level Differential Privacy

Justus Mattern, Benjamin Weggenmann, Florian Kerschbaum

As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.

LGNov 30, 2023Code
Universal Backdoor Attacks

Benjamin Schneider, Nils Lukas, Florian Kerschbaum

Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset. Our source code is available at https://github.com/Ben-Schneider-code/Universal-Backdoor-Attacks.

CRJun 14, 2023
Fast and Private Inference of Deep Neural Networks by Co-designing Activation Functions

Abdulrahman Diaa, Lucas Fenaux, Thomas Humphries et al.

Machine Learning as a Service (MLaaS) is an increasingly popular design where a company with abundant computing resources trains a deep neural network and offers query access for tasks like image classification. The challenge with this design is that MLaaS requires the client to reveal their potentially sensitive queries to the company hosting the model. Multi-party computation (MPC) protects the client's data by allowing encrypted inferences. However, current approaches suffer from prohibitively large inference times. The inference time bottleneck in MPC is the evaluation of non-linear layers such as ReLU activation functions. Motivated by the success of previous work co-designing machine learning and MPC, we develop an activation function co-design. We replace all ReLUs with a polynomial approximation and evaluate them with single-round MPC protocols, which give state-of-the-art inference times in wide-area networks. Furthermore, to address the accuracy issues previously encountered with polynomial activations, we propose a novel training algorithm that gives accuracy competitive with plaintext models. Our evaluation shows between $3$ and $110\times$ speedups in inference time on large models with up to $23$ million parameters while maintaining competitive inference accuracy.

CVNov 19, 2022
Towards Robust Dataset Learning

Yihan Wu, Xinda Li, Florian Kerschbaum et al.

Adversarial training has been actively studied in recent computer vision research to improve the robustness of models. However, due to the huge computational cost of generating adversarial samples, adversarial training methods are often slow. In this paper, we study the problem of learning a robust dataset such that any classifier naturally trained on the dataset is adversarially robust. Such a dataset benefits the downstream tasks as natural training is much faster than adversarial training, and demonstrates that the desired property of robustness is transferable between models and data. In this work, we propose a principled, tri-level optimization to formulate the robust dataset learning problem. We show that, under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset. Extensive experiments on MNIST, CIFAR10, and TinyImageNet demostrate the effectiveness of our algorithm with different network initializations and architectures.

CRApr 16, 2022
Assessing Differentially Private Variational Autoencoders under Membership Inference

Daniel Bernau, Jonas Robl, Florian Kerschbaum

We present an approach to quantify and compare the privacy-accuracy trade-off for differentially private Variational Autoencoders. Our work complements previous work in two aspects. First, we evaluate the the strong reconstruction MI attack against Variational Autoencoders under differential privacy. Second, we address the data scientist's challenge of setting privacy parameter epsilon, which steers the differential privacy strength and thus also the privacy-accuracy trade-off. In our experimental study we consider image and time series data, and three local and central differential privacy mechanisms. We find that the privacy-accuracy trade-offs strongly depend on the dataset and model architecture. We do rarely observe favorable privacy-accuracy trade-off for Variational Autoencoders, and identify a case where LDP outperforms CDP.

CRSep 29, 2023
Leveraging Optimization for Adaptive Attacks on Image Watermarks

Nils Lukas, Abdulrahman Diaa, Lucas Fenaux et al.

Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in unethical activities. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. When evaluating watermarking algorithms and their (adaptive) attacks, it is challenging to determine whether an adaptive attack is optimal, i.e., the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at no visible degradation in image quality. Optimizing our attacks is efficient and requires less than 1 GPU hour to reduce the detection accuracy to 6.3% or less. Our findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.

CRMar 10
ZipPIR: High-throughput Single-server PIR without Client-side Storage

Rasoul Akhavan Mahdavi, Abdulrahman Diaa, Florian Kerschbaum

Private Information Retrieval (PIR) allows a client to privately access a database without revealing which element is accessed. Initial PIR protocols based on Ring Learning with Errors (RLWE) demonstrated the practicality of PIR, but achieve limited throughput. Alternatively, high-throughput protocols leverage an offline phase that requires substantial client-side storage (e.g., hints in SimplePIR) or involve prohibitive communication costs during the offline phase (e.g., Piano). These limitations conflict with the practical constraints of resource-limited clients and are further exacerbated by dynamic databases, where updates necessitate costly regeneration and retransmission of hints. To address these challenges, we propose ZipPIR, a high-throughput PIR protocol that compresses LWE ciphertexts into significantly smaller Paillier ciphertexts. ZipPIR leverages the offline phase to obtain this size reduction without incurring the associated computational cost in the online phase. Moreover, under computational assumptions, ZipPIR features an almost silent offline phase, requiring no communication beyond an initial public key, enabling the server to independently generate and update hints during idle times without client interaction. ZipPIR achieves over 2 GB/s of throughput - comparable to state-of-the-art protocols such as SimplePIR - without the need for a large client-stored hint. For PIR over a 1 GB database, ZipPIR has up to 10x higher throughput than existing protocols with no client-side storage, while requiring less than 200 KB of server-side storage per client, significantly enhancing scalability for practical deployments. While prior PIR protocols using Paillier are very inefficient, ZipPIR is the first PIR protocol using Paillier that achieves throughput that is competitive with state-of-the-art PIR protocols.

LGNov 14, 2025
On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

Lucas Fenaux, Christopher Srinivasa, Florian Kerschbaum

Transparency and security are both central to Responsible AI, but they may conflict in adversarial settings. We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks. In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender's target model. These models can be defended or undefended, with both players having to decide which to use. Using a large-scale empirical evaluation of nine attacks across 181 models, we find that attackers are more successful when they match the defender's decision; hence, obscurity could be beneficial to the defender. With game theory, we analyze this trade-off between transparency and security by modeling this problem as both a Nash game and a Stackelberg game, and comparing the expected outcomes. Our analysis confirms that only knowing whether a defender's model is defended or not can sometimes be enough to damage its security. This result serves as an indicator of the general trade-off between transparency and security, suggesting that transparency in AI systems can be at odds with security. Beyond adversarial machine learning, our work illustrates how game-theoretic reasoning can uncover conflicts between transparency and security.

CVMar 1, 2025Code
ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Benjamin Schneider, Florian Kerschbaum, Wenhu Chen

Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/

LGMay 8
Private Vertical Federated Inference for Time-Series

Lucas Fenaux, Larris Xie, Aditya Bang et al.

Institutions may benefit from collaborative inference on time-series data. In settings where privacy is necessary, multi-party computation (MPC) is a straightforward approach to providing strong guarantees, yet it remains prohibitively expensive and scales poorly with modern transformer architectures. Vertical Federated Learning (VFL) offers efficiency but suffers from privacy leakage at the embedding level, and securing the entire VFL model head via MPC remains prohibitively slow and communication-heavy for larger models. To enable practical, secure inference at scale, we propose "Public/Private Hybrid Head-VFL" (PPHH-VFL). This hybrid architecture splits the model head into an efficient plaintext public head and a secure, lightweight MPC private head. By applying adversarial training to the public embeddings, we mitigate privacy leakage; concurrently, the small private head securely preserves the flow of sensitive information needed for high downstream utility. Empirical evaluations on models ranging up to 86 million parameters demonstrate that PPHH-VFL accelerates inference by up to six orders of magnitude compared to end-to-end MPC. Compared to a standard VFL+MPC baseline, our approach scales significantly better, achieving a speedup of up to 44.4x in WAN and a 91.2x reduction in communication costs (dropping from 1.7 GB to 19 MB per batch), while simultaneously improving downstream classification accuracy by 2.50% and regression RMSE by 40.7%.

CRFeb 13
Backdooring Bias in Large Language Models

Anudeep Das, Prach Chantasantitam, Gurjot Singh et al.

Large language models (LLMs) are increasingly deployed in settings where inducing a bias toward a certain topic can have significant consequences, and backdoor attacks can be used to produce such models. Prior work on backdoor attacks has largely focused on a black-box threat model, with an adversary targeting the model builder's LLM. However, in the bias manipulation setting, the model builder themselves could be the adversary, warranting a white-box threat model where the attacker's ability to poison, and manipulate the poisoned data is substantially increased. Furthermore, despite growing research in semantically-triggered backdoors, most studies have limited themselves to syntactically-triggered attacks. Motivated by these limitations, we conduct an analysis consisting of over 1000 evaluations using higher poisoning ratios and greater data augmentation to gain a better understanding of the potential of syntactically- and semantically-triggered backdoor attacks in a white-box setting. In addition, we study whether two representative defense paradigms, model-intrinsic and model-extrinsic backdoor removal, are able to mitigate these attacks. Our analysis reveals numerous new findings. We discover that while both syntactically- and semantically-triggered attacks can effectively induce the target behaviour, and largely preserve utility, semantically-triggered attacks are generally more effective in inducing negative biases, while both backdoor types struggle with causing positive biases. Furthermore, while both defense types are able to mitigate these backdoors, they either result in a substantial drop in utility, or require high computational overhead.

CRMay 3, 2024
FastLloyd: Federated, Accurate, Secure, and Tunable $k$-Means Clustering with Differential Privacy

Abdulrahman Diaa, Thomas Humphries, Florian Kerschbaum

We study the problem of privacy-preserving $k$-means clustering in the horizontally federated setting. Existing federated approaches using secure computation suffer from substantial overheads and do not offer output privacy. At the same time, differentially private (DP) $k$-means algorithms either assume a trusted central curator or significantly degrade utility by adding noise in the local DP model. Naively combining the secure and central DP solutions results in a protocol with impractical overhead. Instead, our work provides enhancements to both the DP and secure computation components, resulting in a design that is faster, more private, and more accurate than previous work. By utilizing the computational DP model, we design a lightweight, secure aggregation-based approach that achieves five orders of magnitude speed-up over state-of-the-art related work. Furthermore, we not only maintain the utility of the state-of-the-art in the central model of DP, but we improve the utility further by designing a new DP clustering mechanism.

LGSep 9, 2025
Hammer and Anvil: A Principled Defense Against Backdoors in Federated Learning

Lucas Fenaux, Zheng Wang, Jacob Yan et al.

Federated Learning is a distributed learning technique in which multiple clients cooperate to train a machine learning model. Distributed settings facilitate backdoor attacks by malicious clients, who can embed malicious behaviors into the model during their participation in the training process. These malicious behaviors are activated during inference by a specific trigger. No defense against backdoor attacks has stood the test of time, especially against adaptive attackers, a powerful but not fully explored category of attackers. In this work, we first devise a new adaptive adversary that surpasses existing adversaries in capabilities, yielding attacks that only require one or two malicious clients out of 20 to break existing state-of-the-art defenses. Then, we present Hammer and Anvil, a principled defense approach that combines two defenses orthogonal in their underlying principle to produce a combined defense that, given the right set of parameters, must succeed against any attack. We show that our best combined defense, Krum+, is successful against our new adaptive adversary and state-of-the-art attacks.

CLMay 22, 2025
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

Alireza Arbabi, Florian Kerschbaum

The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what "bias" entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.

CRApr 10, 2025
Privacy-Preserving Vertical K-Means Clustering

Federico Mazzone, Trevor Brown, Florian Kerschbaum et al.

Clustering is a fundamental data processing task used for grouping records based on one or more features. In the vertically partitioned setting, data is distributed among entities, with each holding only a subset of those features. A key challenge in this scenario is that computing distances between records requires access to all distributed features, which may be privacy-sensitive and cannot be directly shared with other parties. The goal is to compute the joint clusters while preserving the privacy of each entity's dataset. Existing solutions using secret sharing or garbled circuits implement privacy-preserving variants of Lloyd's algorithm but incur high communication costs, scaling as O(nkt), where n is the number of data points, k the number of clusters, and t the number of rounds. These methods become impractical for large datasets or several parties, limiting their use to LAN settings only. On the other hand, a different line of solutions rely on differential privacy (DP) to outsource the local features of the parties to a central server. However, they often significantly degrade the utility of the clustering outcome due to excessive noise. In this work, we propose a novel solution based on homomorphic encryption and DP, reducing communication complexity to O(n+kt). In our method, parties securely outsource their features once, allowing a computing party to perform clustering operations under encryption. DP is applied only to the clusters' centroids, ensuring privacy with minimal impact on utility. Our solution clusters 100,000 two-dimensional points into five clusters using only 73MB of communication, compared to 101GB for existing works, and completes in just under 3 minutes on a 100Mbps network, whereas existing works take over 1 day. This makes our solution practical even for WAN deployments, all while maintaining accuracy comparable to plaintext k-means algorithms.

LGFeb 22, 2024
SoK: Analyzing Adversarial Examples: A Framework to Study Adversary Knowledge

Lucas Fenaux, Florian Kerschbaum

Adversarial examples are malicious inputs to machine learning models that trigger a misclassification. This type of attack has been studied for close to a decade, and we find that there is a lack of study and formalization of adversary knowledge when mounting attacks. This has yielded a complex space of attack research with hard-to-compare threat models and attacks. We focus on the image classification domain and provide a theoretical framework to study adversary knowledge inspired by work in order theory. We present an adversarial example game, inspired by cryptographic games, to standardize attacks. We survey recent attacks in the image classification domain and classify their adversary's knowledge in our framework. From this systematization, we compile results that both confirm existing beliefs about adversary knowledge, such as the potency of information about the attacked model as well as allow us to derive new conclusions on the difficulty associated with the white-box and transferable threat models, for example, that transferable attacks might not be as difficult as previously thought.

CRMay 7, 2023
Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks

Nils Lukas, Florian Kerschbaum

Deep image classification models trained on vast amounts of web-scraped data are susceptible to data poisoning - a mechanism for backdooring models. A small number of poisoned samples seen during training can severely undermine a model's integrity during inference. Existing work considers an effective defense as one that either (i) restores a model's integrity through repair or (ii) detects an attack. We argue that this approach overlooks a crucial trade-off: Attackers can increase robustness at the expense of detectability (over-poisoning) or decrease detectability at the cost of robustness (under-poisoning). In practice, attacks should remain both undetectable and robust. Detectable but robust attacks draw human attention and rigorous model evaluation or cause the model to be re-trained or discarded. In contrast, attacks that are undetectable but lack robustness can be repaired with minimal impact on model accuracy. Our research points to intrinsic flaws in current attack evaluation methods and raises the bar for all data poisoning attackers who must delicately balance this trade-off to remain robust and undetectable. To demonstrate the existence of more potent defenders, we propose defenses designed to (i) detect or (ii) repair poisoned models using a limited amount of trusted image-label pairs. Our results show that an attacker who needs to be robust and undetectable is substantially less threatening. Our defenses mitigate all tested attacks with a maximum accuracy decline of 2% using only 1% of clean data on CIFAR-10 and 2.5% on ImageNet. We demonstrate the scalability of our defenses by evaluating large vision-language models, such as CLIP. Attackers who can manipulate the model's parameters pose an elevated risk as they can achieve higher robustness at low detectability compared to data poisoning attackers.

CRFeb 15, 2022
Constant-weight PIR: Single-round Keyword PIR via Constant-weight Equality Operators

Rasoul Akhavan Mahdavi, Florian Kerschbaum

Equality operators are an essential building block in tasks over secure computation such as private information retrieval. In private information retrieval (PIR), a user queries a database such that the server does not learn which element is queried. In this work, we propose \emph{equality operators for constant-weight codewords}. A constant-weight code is a collection of codewords that share the same Hamming weight. Constant-weight equality operators have a multiplicative depth that depends only on the Hamming weight of the code, not the bit-length of the elements. In our experiments, we show how these equality operators are up to 10 times faster than existing equality operators. Furthermore, we propose PIR using the constant-weight equality operator or \emph{constant-weight PIR}, which is a PIR protocol using an approach previously deemed impractical. We show that for private retrieval of large, streaming data, constant-weight PIR has a smaller communication complexity and lower runtime compared to SEALPIR and MulPIR, respectively, which are two state-of-the-art solutions for PIR. Moreover, we show how constant-weight PIR can be extended to keyword PIR. In keyword PIR, the desired element is retrieved by a unique identifier pertaining to the sought item, e.g., the name of a file. Previous solutions to keyword PIR require one or multiple rounds of communication to reduce the problem to normal PIR. We show that constant-weight PIR is the first practical single-round solution to single-server keyword PIR.

CROct 11, 2021
Generalization Techniques Empirically Outperform Differential Privacy against Membership Inference

Jiaxiang Liu, Simon Oya, Florian Kerschbaum

Differentially private training algorithms provide protection against one of the most popular attacks in machine learning: the membership inference attack. However, these privacy algorithms incur a loss of the model's classification accuracy, therefore creating a privacy-utility trade-off. The amount of noise that differential privacy requires to provide strong theoretical protection guarantees in deep learning typically renders the models unusable, but authors have observed that even lower noise levels provide acceptable empirical protection against existing membership inference attacks. In this work, we look for alternatives to differential privacy towards empirically protecting against membership inference attacks. We study the protection that simply following good machine learning practices (not designed with privacy in mind) offers against membership inference. We evaluate the performance of state-of-the-art techniques, such as pre-training and sharpness-aware minimization, alone and with differentially private training algorithms, and find that, when using early stopping, the algorithms without differential privacy can provide both higher utility and higher privacy than their differentially private counterparts. These findings challenge the belief that differential privacy is a good defense to protect against existing membership inference attacks

CROct 8, 2021
IHOP: Improved Statistical Query Recovery against Searchable Symmetric Encryption through Quadratic Optimization

Simon Oya, Florian Kerschbaum

Effective query recovery attacks against Searchable Symmetric Encryption (SSE) schemes typically rely on auxiliary ground-truth information about the queries or dataset. Query recovery is also possible under the weaker statistical auxiliary information assumption, although statistical-based attacks achieve lower accuracy and are not considered a serious threat. In this work we present IHOP, a statistical-based query recovery attack that formulates query recovery as a quadratic optimization problem and reaches a solution by iterating over linear assignment problems. We perform an extensive evaluation with five real datasets, and show that IHOP outperforms all other statistical-based query recovery attacks under different parameter and leakage configurations, including the case where the client uses some access-pattern obfuscation defenses. In some cases, our attack achieves almost perfect query recovery accuracy. Finally, we use IHOP in a frequency-only leakage setting where the client's queries are correlated, and show that our attack can exploit query dependencies even when PANCAKE, a recent frequency-hiding defense by Grubbs et al., is applied. Our findings indicate that statistical query recovery attacks pose a severe threat to privacy-preserving SSE schemes.

CRAug 11, 2021
SoK: How Robust is Image Classification Deep Neural Network Watermarking? (Extended Version)

Nils Lukas, Edward Jiang, Xinda Li et al.

Deep Neural Network (DNN) watermarking is a method for provenance verification of DNN models. Watermarking should be robust against watermark removal attacks that derive a surrogate model that evades provenance verification. Many watermarking schemes that claim robustness have been proposed, but their robustness is only validated in isolation against a relatively small set of attacks. There is no systematic, empirical evaluation of these claims against a common, comprehensive set of removal attacks. This uncertainty about a watermarking scheme's robustness causes difficulty to trust their deployment in practice. In this paper, we evaluate whether recently proposed watermarking schemes that claim robustness are robust against a large set of removal attacks. We survey methods from the literature that (i) are known removal attacks, (ii) derive surrogate models but have not been evaluated as removal attacks, and (iii) novel removal attacks. Weight shifting and smooth retraining are novel removal attacks adapted to the DNN watermarking schemes surveyed in this paper. We propose taxonomies for watermarking schemes and removal attacks. Our empirical evaluation includes an ablation study over sets of parameters for each attack and watermarking scheme on the CIFAR-10 and ImageNet datasets. Surprisingly, none of the surveyed watermarking schemes is robust in practice. We find that schemes fail to withstand adaptive attacks and known methods for deriving surrogate models that have not been evaluated as removal attacks. This points to intrinsic flaws in how robustness is currently evaluated. We show that watermarking schemes need to be evaluated against a more extensive set of removal attacks with a more realistic adversary model. Our source code and a complete dataset of evaluation results are publicly available, which allows to independently verify our conclusions.

CRJul 26, 2021
Selective MPC: Distributed Computation of Differentially Private Key-Value Statistics

Thomas Humphries, Rasoul Akhavan Mahdavi, Shannon Veitch et al.

Key-value data is a naturally occurring data type that has not been thoroughly investigated in the local trust model. Existing local differentially private (LDP) solutions for computing statistics over key-value data suffer from the inherent accuracy limitations of each user adding their own noise. Multi-party computation (MPC) maintains better accuracy than LDP and similarly does not require a trusted central party. However, naively applying MPC to key-value data results in prohibitively expensive computation costs. In this work, we present selective multi-party computation, a novel approach to distributed computation that leverages DP leakage to efficiently and accurately compute statistics over key-value data. By providing each party with a view of a random subset of the data, we can capture subtractive noise. We prove that our protocol satisfies pure DP and is provably secure in the combined DP/MPC model. Our empirical evaluation demonstrates that we can compute statistics over 10,000 keys in 20 seconds and can scale up to 30 servers while obtaining results for a single key in under a second.

CRMar 10, 2021
Equi-Joins Over Encrypted Data for Series of Queries

Masoumeh Shafieinejad, Suraj Gupta, Jin Yang Liu et al.

Encryption provides a method to protect data outsourced to a DBMS provider, e.g., in the cloud. However, performing database operations over encrypted data requires specialized encryption schemes that carefully balance security and performance. In this paper, we present a new encryption scheme that can efficiently perform equi-joins over encrypted data with better security than the state-of-the-art. In particular, our encryption scheme reduces the leakage to equality of rows that match a selection criterion and only reveals the transitive closure of the sum of the leakages of each query in a series of queries. Our encryption scheme is provable secure. We implemented our encryption scheme and evaluated it over a dataset from the TPC-H benchmark.

DBMar 9, 2021
PCOR: Private Contextual Outlier Release via Differentially Private Search

Masoumeh Shafieinejad, Florian Kerschbaum, Ihab F. Ilyas

Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are particularly important in data exploration and targeted anomaly explanation and diagnosis. In these scenarios, the data owner computes the following information: i) The attributes that contribute to the abnormality of an outlier (metric), ii) Contextual description of the outlier's neighborhoods (context), and iii) The utility score of the context, e.g. its strength in showing the outlier's significance, or in relation to a particular explanation for the outlier. However, revealing the outlier's context leaks information about the other individuals in the population as well, violating their privacy. We address the issue of population privacy violations in this paper, and propose a solution for the two main challenges. In this setting, the data owner is required to release a valid context for the queried record, i.e. a context in which the record is an outlier. Hence, the first major challenge is that the privacy technique must preserve the validity of the context for each record. We propose techniques to protect the privacy of individuals through a relaxed notion of differential privacy to satisfy this requirement. The second major challenge is applying the proposed techniques efficiently, as they impose intensive computation to the base algorithm. To overcome this challenge, we propose a graph structure to map the contexts to, and introduce differentially private graph search algorithms as efficient solutions for the computation problem caused by differential privacy techniques.

CRMar 4, 2021
Quantifying identifiability to choose and audit $ε$ in differentially private deep learning

Daniel Bernau, Günther Eibl, Philip W. Grassal et al.

Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters $(ε,δ)$. Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss $(ε,δ)$ might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which $(ε,δ)$ are only indirectly related. We transform $(ε,δ)$ to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical $(ε,δ)$.

CRFeb 18, 2021
Obfuscated Access and Search Patterns in Searchable Encryption

Zhiwei Shang, Simon Oya, Andreas Peter et al.

Searchable Symmetric Encryption (SSE) allows a data owner to securely outsource its encrypted data to a cloud server while maintaining the ability to search over it and retrieve matched documents. Most existing SSE schemes leak which documents are accessed per query, i.e., the so-called access pattern, and thus are vulnerable to attacks that can recover the database or the queried keywords. Current techniques that fully hide access patterns, such as ORAM or PIR, suffer from heavy communication or computational costs, and are not designed with search capabilities in mind. Recently, Chen et al. (INFOCOM'18) proposed an obfuscation framework for SSE that protects the access pattern in a differentially private way with a reasonable utility cost. However, this scheme leaks the so-called search pattern, i.e., how many times a certain query is performed. This leakage makes the proposal vulnerable to certain database and query recovery attacks. In this paper, we propose OSSE (Obfuscated SSE), an SSE scheme that obfuscates the access pattern independently for each query performed. This in turn hides the search pattern and makes our scheme resistant against attacks that rely on this leakage. Under certain reasonable assumptions, our scheme has smaller communication overhead than ORAM-based SSE. Furthermore, our scheme works in a single communication round and requires very small constant client-side storage. Our empirical evaluation shows that OSSE is highly effective at protecting against different query recovery attacks while keeping a reasonable utility level. Our protocol provides significantly more protection than the proposal by Chen et al.~against some state-of-the-art attacks, which demonstrates the importance of hiding search patterns in designing effective privacy-preserving SSE schemes.

CROct 23, 2020
Investigating Membership Inference Attacks under Data Dependencies

Thomas Humphries, Simon Oya, Lindsey Tulloch et al.

Training machine learning models on privacy-sensitive data has become a popular practice, driving innovation in ever-expanding fields. This has opened the door to new attacks that can have serious privacy implications. One such attack, the Membership Inference Attack (MIA), exposes whether or not a particular data point was used to train a model. A growing body of literature uses Differentially Private (DP) training algorithms as a defence against such attacks. However, these works evaluate the defence under the restrictive assumption that all members of the training set, as well as non-members, are independent and identically distributed. This assumption does not hold for many real-world use cases in the literature. Motivated by this, we evaluate membership inference with statistical dependencies among samples and explain why DP does not provide meaningful protection (the privacy parameter $ε$ scales with the training set size $n$) in this more general case. We conduct a series of empirical evaluations with off-the-shelf MIAs using training sets built from real-world data showing different types of dependencies among samples. Our results reveal that training set dependencies can severely increase the performance of MIAs, and therefore assuming that data samples are statistically independent can significantly underestimate the performance of MIAs.

CROct 7, 2020
Hiding the Access Pattern is Not Enough: Exploiting Search Pattern Leakage in Searchable Encryption

Simon Oya, Florian Kerschbaum

Recent Searchable Symmetric Encryption (SSE) schemes enable secure searching over an encrypted database stored in a server while limiting the information leaked to the server. These schemes focus on hiding the access pattern, which refers to the set of documents that match the client's queries. This provides protection against current attacks that largely depend on this leakage to succeed. However, most SSE constructions also leak whether or not two queries aim for the same keyword, also called the search pattern. In this work, we show that search pattern leakage can severely undermine current SSE defenses. We propose an attack that leverages both access and search pattern leakage, as well as some background and query distribution information, to recover the keywords of the queries performed by the client. Our attack follows a maximum likelihood estimation approach, and is easy to adapt against SSE defenses that obfuscate the access pattern. We empirically show that our attack is efficient, it outperforms other proposed attacks, and it completely thwarts two out of the three defenses we evaluate it against, even when these defenses are set to high privacy regimes. These findings highlight that hiding the search pattern, a feature that most constructions are lacking, is key towards providing practical privacy guarantees in SSE.

DBMar 20, 2020
Efficient Oblivious Database Joins

Simeon Krastnikov, Florian Kerschbaum, Douglas Stebila

A major algorithmic challenge in designing applications intended for secure remote execution is ensuring that they are oblivious to their inputs, in the sense that their memory access patterns do not leak sensitive information to the server. This problem is particularly relevant to cloud databases that wish to allow queries over the client's encrypted data. One of the major obstacles to such a goal is the join operator, which is non-trivial to implement obliviously without resorting to generic but inefficient solutions like Oblivious RAM (ORAM). We present an oblivious algorithm for equi-joins which (up to a logarithmic factor) matches the optimal $O(n\log n)$ complexity of the standard non-secure sort-merge join (on inputs producing $O(n)$ outputs). We do not use use expensive primitives like ORAM or rely on unrealistic hardware or security assumptions. Our approach, which is based on sorting networks and novel provably-oblivious constructions, is conceptually simple, easily verifiable, and very efficient in practice. Its data-independent algorithmic structure makes it secure in various different settings for remote computation, even in those that are known to be vulnerable to certain side-channel attacks (such as Intel SGX) or with strict requirements for low circuit complexity (like secure multiparty computation). We confirm that our approach is easily realizable through a compact implementation which matches our expectations for performance and is shown, both formally and empirically, to possess the desired security characteristics.

CRFeb 12, 2020
EncDBDB: Searchable Encrypted, Fast, Compressed, In-Memory Database using Enclaves

Benny Fuhry, Jayanth Jain H A, Florian Kerschbaum

Data confidentiality is an important requirement for clients when outsourcing databases to the cloud. Trusted execution environments, such as Intel SGX, offer an efficient, hardware-based solution to this cryptographic problem. Existing solutions are not optimized for column-oriented, in-memory databases and pose impractical memory requirements on the enclave. We present EncDBDB, a novel approach for client-controlled encryption of a column-oriented, in-memory databases allowing range searches using an enclave. EncDBDB offers nine encrypted dictionaries, which provide different security, performance and storage efficiency tradeoffs for the data. It is especially suited for complex, read-oriented, analytic queries, e.g., as present in data warehouses. The computational overhead compared to plaintext processing is within a millisecond even for databases with millions of entries and the leakage is limited. Compressed encrypted data requires less space than a corresponding plaintext column. Furthermore, the resulting code - and data - in the enclave is very small reducing the potential for security-relevant implementation errors and side-channel leakages.

CRDec 24, 2019
Assessing differentially private deep learning with Membership Inference

Daniel Bernau, Philip-William Grassal, Jonas Robl et al.

Attacks that aim to identify the training data of public neural networks represent a severe threat to the privacy of individuals participating in the training data set. A possible protection is offered by anonymization of the training data or training function with differential privacy. However, data scientists can choose between local and central differential privacy and need to select meaningful privacy parameters $ε$ which is challenging for non-privacy experts. We empirically compare local and central differential privacy mechanisms under white- and black-box membership inference to evaluate their relative privacy-accuracy trade-offs. We experiment with several datasets and show that this trade-off is similar for both types of mechanisms. This suggests that local differential privacy is a sound alternative to central differential privacy for differentially private deep learning, since small $ε$ in central differential privacy and large $ε$ in local differential privacy result in similar membership inference attack risk.

LGDec 2, 2019
Deep Neural Network Fingerprinting by Conferrable Adversarial Examples

Nils Lukas, Yuxuan Zhang, Florian Kerschbaum

In Machine Learning as a Service, a provider trains a deep neural network and gives many users access. The hosted (source) model is susceptible to model stealing attacks, where an adversary derives a surrogate model from API access to the source model. For post hoc detection of such attacks, the provider needs a robust method to determine whether a suspect model is a surrogate of their model. We propose a fingerprinting method for deep neural network classifiers that extracts a set of inputs from the source model so that only surrogates agree with the source model on the classification of such inputs. These inputs are a subclass of transferable adversarial examples which we call conferrable adversarial examples that exclusively transfer with a target label from a source model to its surrogates. We propose a new method to generate these conferrable adversarial examples. We present an extensive study on the irremovability of our fingerprint against fine-tuning, weight pruning, retraining, retraining with different architectures, three model extraction attacks from related work, transfer learning, adversarial training, and two new adaptive attacks. Our fingerprint is robust against distillation, related model extraction attacks, and even transfer learning when the attacker has no access to the model provider's dataset. Our fingerprint is the first method that reaches a ROC AUC of 1.0 in verifying surrogates, compared to a ROC AUC of 0.63 by previous fingerprints.

CROct 31, 2019
RIGA: Covert and Robust White-Box Watermarking of Deep Neural Networks

Tianhao Wang, Florian Kerschbaum

Watermarking of deep neural networks (DNN) can enable their tracing once released by a data owner. In this paper, we generalize white-box watermarking algorithms for DNNs, where the data owner needs white-box access to the model to extract the watermark. White-box watermarking algorithms have the advantage that they do not impact the accuracy of the watermarked model. We propose Robust whIte-box GAn watermarking (RIGA), a novel white-box watermarking algorithm that uses adversarial training. Our extensive experiments demonstrate that the proposed watermarking algorithm not only does not impact accuracy, but also significantly improves the covertness and robustness over the current state-of-art.

CRSep 18, 2019
Non-Interactive Private Decision Tree Evaluation

Anselme Tueno, Yordan Boev, Florian Kerschbaum

Decision trees are a powerful prediction model with many applications in statistics, data mining, and machine learning. In some settings, the model and the data to be classified may contain sensitive information belonging to different parties. In this paper, we, therefore, address the problem of privately evaluating a decision tree on private data. This scenario consists of a server holding a private decision tree model and a client interested in classifying its private attribute vector using the server's private model. The goal of the computation is to obtain the classification while preserving the privacy of both - the decision tree and the client input. After the computation, the classification result is revealed only to the client, and nothing else is revealed neither to the client nor to the server. Existing privacy-preserving protocols that address this problem use or combine different generic secure multiparty computation approaches resulting in several interactions between the client and the server. Our goal is to design and implement a novel client-server protocol that delegates the complete tree evaluation to the server while preserving privacy and reducing the overhead. The idea is to use fully (somewhat) homomorphic encryption and evaluate the tree on ciphertexts encrypted under the client's public key. However, since current somewhat homomorphic encryption schemes have high overhead, we combine efficient data representations with different algorithmic optimizations to keep the computational overhead and the communication cost low. As a result, we are able to provide the first non-interactive protocol, that allows the client to delegate the evaluation to the server by sending an encrypted input and receiving only the encryption of the result. Our scheme has only one round and can evaluate a complete tree of depth 10 within seconds.

CRSep 18, 2019
Secure Computation of the kth-Ranked Element in a Star Network

Anselme Tueno, Florian Kerschbaum, Stefan Katzenbeisser et al.

We consider the problem of securely computing the kth-ranked element in a sequence of n private integers distributed among n parties. The kth-ranked element (e.g., minimum, maximum, median) is of particular interest in benchmarking, which allows a company to compare its own key performance indicator to the statistics of its peer group. The individual integers are sensitive data, yet the kth-ranked element is of mutual interest to the parties. Previous secure computation protocols for the kth-ranked element require a communication channel between each pair of parties. They do not scale to a large number of parties as they are highly interactive resulting in longer delays. Moreover, they are difficult to deploy as special arrangements are required between each pair of parties to establish a secure connection. A server model naturally fits with the client-server architecture of Internet applications in which clients are connected to the server and not to other clients. It can simplify secure computation by reducing the number of rounds, and as a result, improve its performance and scalability. In this model, there are communication channels only between each client and the server, while only clients provide inputs to the computation. Hence, it is a centralized communication pattern, i.e., a star network. We propose different approaches for privately computing the kth-ranked element in the server model, using either garbled circuits or threshold homomorphic encryption. Our schemes have a constant number of rounds and can compute the kth-ranked element within seconds for up to 50 clients in a WAN.

LGJun 18, 2019
On the Robustness of the Backdoor-based Watermarking in Deep Neural Networks

Masoumeh Shafieinejad, Jiaqi Wang, Nils Lukas et al.

Obtaining the state of the art performance of deep learning models imposes a high cost to model generators, due to the tedious data preparation and the substantial processing requirements. To protect the model from unauthorized re-distribution, watermarking approaches have been introduced in the past couple of years. We investigate the robustness and reliability of state-of-the-art deep neural network watermarking schemes. We focus on backdoor-based watermarking and propose two -- a black-box and a white-box -- attacks that remove the watermark. Our black-box attack steals the model and removes the watermark with minimum requirements; it just relies on public unlabeled data and a black-box access to the classification label. It does not need classification confidences or access to the model's sensitive information such as the training data set, the trigger set or the model parameters. The white-box attack, proposes an efficient watermark removal when the parameters of the marked model are available; our white-box attack does not require access to the labeled data or the trigger set and improves the runtime of the black-box attack up to seventeen times. We as well prove the security inadequacy of the backdoor-based watermarking in keeping the watermark undetectable by proposing an attack that detects whether a model contains a watermark. Our attacks show that a recipient of a marked model can remove a backdoor-based watermark with significantly less effort than training a new model and some other techniques are needed to protect against re-distribution by a motivated attacker.

CRMay 2, 2018
SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Benjamin Weggenmann, Florian Kerschbaum

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search logs can reveal sensitive details about a user, critical articles or messages about a company or government might have severe or fatal consequences for a critic, and negative feedback in customer surveys might negatively impact business relations if they are identified. Simply removing personally identifying information from a document is, however, insufficient to protect the writer's identity: Given some reference texts of suspect authors, so-called authorship attribution methods can reidentfy the author from the text itself. One of the most prominent models to represent documents in many common text mining and information retrieval tasks is the vector space model where each document is represented as a vector, typically containing its term frequencies or related quantities. We therefore propose an automated text anonymization approach that produces synthetic term frequency vectors for the input documents that can be used in lieu of the original vectors. We evaluate our method on an exemplary text classification task and demonstrate that it only has a low impact on its accuracy. In contrast, we show that our method strongly affects authorship attribution techniques to the level that they become infeasible with a much stronger decline in accuracy. Other than previous authorship obfuscation methods, our approach is the first that fulfills differential privacy and hence comes with a provable plausible deniability guarantee.

CRFeb 4, 2018
Secure Range Queries for Multiple Users

Anselme Tueno, Florian Kerschbaum

Order-preserving encryption allows encrypting data, while still enabling efficient range queries on the encrypted data. Moreover, it does not require any change to the database management system, because comparison operates on ciphertexts as on plaintexts. This makes order-preserving encryption schemes very suitable for data outsourcing in cloud computing scenarios. However, all order-preserving encryption schemes are necessarily symmetric limiting the use case to one client and one server. Imagine a scenario where a Data Owner encrypts its data before outsourcing it to the Cloud Service Provider and a Data Analyst wants to execute private range queries on this data. This scenario occurs in many cases of collaborative machine learning where data source and processor are different entities. Then either the Data Owner must reveal its encryption key or the Data Analyst must reveal the private queries. In this paper, we overcome this limitation by allowing the equivalent of a public-key order-preserving encryption. We present a secure multiparty protocol that enables secure range queries for multiple users. In this scheme, the Data Analyst cooperates with the Data Owner and the Cloud Service Provider in order to order-preserving encrypt the private range queries without revealing any other information to the parties. We implemented our scheme and observed that if the database size of the Data Owner has 1 million entries it takes only about 0.3 s on average via a loopback interface (1.3 s via a LAN) to encrypt an input of the Data Analyst.

CROct 1, 2017
Computation on Encrypted Data using Data Flow Authentication

Andreas Fischer, Benny Fuhry, Florian Kerschbaum et al.

Encrypting data before sending it to the cloud protects it against hackers and malicious insiders, but requires the cloud to compute on encrypted data. Trusted (hardware) modules, e.g., secure enclaves like Intel's SGX, can very efficiently run entire programs in encrypted memory. However, it already has been demonstrated that software vulnerabilities give an attacker ample opportunity to insert arbitrary code into the program. This code can then modify the data flow of the program and leak any secret in the program to an observer in the cloud via SGX side-channels. Since any larger program is rife with software vulnerabilities, it is not a good idea to outsource entire programs to an SGX enclave. A secure alternative with a small trusted code base would be fully homomorphic encryption (FHE) -- the holy grail of encrypted computation. However, due to its high computational complexity it is unlikely to be adopted in the near future. As a result researchers have made several proposals for transforming programs to perform encrypted computations on less powerful encryption schemes. Yet, current approaches fail on programs that make control-flow decisions based on encrypted data. In this paper, we introduce the concept of data flow authentication (DFAuth). DFAuth prevents an adversary from arbitrarily deviating from the data flow of a program. Hence, an attacker cannot perform an attack as outlined before on SGX. This enables that all programs, even those including operations on control-flow decision variables, can be computed on encrypted data. We implemented DFAuth using a novel authenticated homomorphic encryption scheme, a Java bytecode-to-bytecode compiler producing fully executable programs, and SGX enclaves. A transformed neural network that performs machine learning on sensitive medical data can be evaluated on encrypted inputs and encrypted weights in 0.86 seconds.

CRSep 27, 2017
An Efficiently Searchable Encrypted Data Structure for Range Queries

Florian Kerschbaum, Anselme Tueno

At CCS 2015 Naveed et al. presented first attacks on efficiently searchable encryption, such as deterministic and order-preserving encryption. These plaintext guessing attacks have been further improved in subsequent work, e.g. by Grubbs et al. in 2016. Such cryptanalysis is crucially important to sharpen our understanding of the implications of security models. In this paper we present an efficiently searchable, encrypted data structure that is provably secure against these and even more powerful chosen plaintext attacks. Our data structure supports logarithmic-time search with linear space complexity. The indices of our data structure can be used to search by standard comparisons and hence allow easy retrofitting to existing database management systems. We implemented our scheme and show that its search time overhead is only 10 milliseconds compared to non-secure search.

CRMar 14, 2017
HardIDX: Practical and Secure Index with SGX

Benny Fuhry, Raad Bahmani, Ferdinand Brasser et al.

Software-based approaches for search over encrypted data are still either challenged by lack of proper, low-leakage encryption or slow performance. Existing hardware-based approaches do not scale well due to hardware limitations and software designs that are not specifically tailored to the hardware architecture, and are rarely well analyzed for their security (e.g., the impact of side channels). Additionally, existing hardware-based solutions often have a large code footprint in the trusted environment susceptible to software compromises. In this paper we present HardIDX: a hardware-based approach, leveraging Intel's SGX, for search over encrypted data. It implements only the security critical core, i.e., the search functionality, in the trusted environment and resorts to untrusted software for the remainder. HardIDX is deployable as a highly performant encrypted database index: it is logarithmic in the size of the index and searches are performed within a few milliseconds rather than seconds. We formally model and prove the security of our scheme showing that its leakage is equivalent to the best known searchable encryption schemes. Our implementation has a very small code and memory footprint yet still scales to virtually unlimited search index sizes, i.e., size is limited only by the general - non-secure - hardware resources.

CRMay 1, 2014
Inference Control for Privacy-Preserving Genome Matching

Florian Kerschbaum, Martin Beck, Dagmar Schönfeld

Privacy is of the utmost importance in genomic matching. Therefore a number of privacy-preserving protocols have been presented using secure computation. Nevertheless, none of these protocols prevents inferences from the result. Goodrich has shown that this resulting information is sufficient for an effective attack on genome databases. In this paper we present an approach that can detect and mitigate such an attack on encrypted messages while still preserving the privacy of both parties. Note that randomization, e.g.~using differential privacy, will almost certainly destroy the utility of the matching result. We combine two known cryptographic primitives -- secure computation of the edit distance and fuzzy commitments -- in order to prevent submission of similar genome sequences. Particularly, we contribute an efficient zero-knowledge proof that the same input has been used in both primitives. We show that using our approach it is feasible to preserve privacy in genome matching and also detect and mitigate Goodrich's attack.

CRSep 24, 2012
Approximate Two-Party Privacy-Preserving String Matching with Linear Complexity

Martin Beck, Florian Kerschbaum

Consider two parties who want to compare their strings, e.g., genomes, but do not want to reveal them to each other. We present a system for privacy-preserving matching of strings, which differs from existing systems by providing a deterministic approximation instead of an exact distance. It is efficient (linear complexity), non-interactive and does not involve a third party which makes it particularly suitable for cloud computing. We extend our protocol, such that it mitigates iterated differential attacks proposed by Goodrich. Further an implementation of the system is evaluated and compared against current privacy-preserving string matching algorithms.