David Lie

CR
h-index15
23papers
1,868citations
Novelty56%
AI Score52

23 Papers

LGSep 22, 2022
In Differential Privacy, There is Truth: On Vote Leakage in Ensemble Private Learning

Jiaqi Wang, Roei Schuster, Ilia Shumailov et al. · deepmind

When learning from sensitive data, care must be taken to ensure that training algorithms address privacy concerns. The canonical Private Aggregation of Teacher Ensembles, or PATE, computes output labels by aggregating the predictions of a (possibly distributed) collection of teacher models via a voting mechanism. The mechanism adds noise to attain a differential privacy guarantee with respect to the teachers' training data. In this work, we observe that this use of noise, which makes PATE predictions stochastic, enables new forms of leakage of sensitive information. For a given input, our adversary exploits this stochasticity to extract high-fidelity histograms of the votes submitted by the underlying teachers. From these histograms, the adversary can learn sensitive attributes of the input such as race, gender, or age. Although this attack does not directly violate the differential privacy guarantee, it clearly violates privacy norms and expectations, and would not be possible at all without the noise inserted to obtain differential privacy. In fact, counter-intuitively, the attack becomes easier as we add more noise to provide stronger differential privacy. We hope this encourages future work to consider privacy holistically rather than treat differential privacy as a panacea.

HCMar 1, 2023
Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails

Mu-Huan Chung, Lu Wang, Sharon Li et al.

Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling possible anomalies in the resulting redacted emails more difficult. Another source of difficulty is the high volume of emails combined with the scarcity of resources making machine learning (ML) a necessity, but also creating a need for more efficient human training of ML models. Active learning (AL) has been proposed as a way to make human training of ML models more efficient. However, the implementation of Active Learning methods is a human-centered AI challenge due to potential human analyst uncertainty, and the labeling task can be further complicated in domains such as the cybersecurity domain (or healthcare, aviation, etc.) where mistakes in labeling can have highly adverse consequences. In this paper we present research results concerning the application of Active Learning to anomaly detection in redacted emails, comparing the utility of different methods for implementing active learning in this context. We evaluate different AL strategies and their impact on resulting model performance. We also examine how ratings of confidence that experts have in their labels can inform AL. The results obtained are discussed in terms of their implications for AL methodology and for the role of experts in model-assisted email anomaly screening.

LGDec 19, 2024Code
Time Will Tell: Timing Side Channels via Output Token Count in Large Language Models

Tianchen Zhang, Gururaj Saileshwar, David Lie

This paper demonstrates a new side-channel that enables an adversary to extract sensitive information about inference inputs in large language models (LLMs) based on the number of output tokens in the LLM response. We construct attacks using this side-channel in two common LLM tasks: recovering the target language in machine translation tasks and recovering the output class in classification tasks. In addition, due to the auto-regressive generation mechanism in LLMs, an adversary can recover the output token count reliably using a timing channel, even over the network against a popular closed-source commercial LLM. Our experiments show that an adversary can learn the output language in translation tasks with more than 75% precision across three different models (Tower, M2M100, MBart50). Using this side-channel, we also show the input class in text classification tasks can be leaked out with more than 70% precision from open-source LLMs like Llama-3.1, Llama-3.2, Gemma2, and production models like GPT-4o. Finally, we propose tokenizer-, system-, and prompt-based mitigations against the output token count side-channel.

CRNov 2, 2017Code
BinPro: A Tool for Binary Source Code Provenance

Dhaval Miyani, Zhen Huang, David Lie

Enforcing open source licenses such as the GNU General Public License (GPL), analyzing a binary for possible vulnerabilities, and code maintenance are all situations where it is useful to be able to determine the source code provenance of a binary. While previous work has either focused on computing binary-to-binary similarity or source-to-source similarity, BinPro is the first work we are aware of to tackle the problem of source-to-binary similarity. BinPro can match binaries with their source code even without knowing which compiler was used to produce the binary, or what optimization level was used with the compiler. To do this, BinPro utilizes machine learning to compute optimal code features for determining binary-to-source similarity and a static analysis pipeline to extract and compute similarity based on those features. Our experiments show that on average BinPro computes a similarity of 81% for matching binaries and source code of the same applications, and an average similarity of 25% for binaries and source code of similar but different applications. This shows that BinPro's similarity score is useful for determining if a binary was derived from a particular source code.

CRMay 5
GPUBreach: Privilege Escalation Attacks on GPUs using Rowhammer

Chris S. Lin, Yuqin Yan, Guozhen Ding et al.

NVIDIA GPUs with GDDR memories have been shown susceptible to Rowhammer-based bit-flips, similar to CPUs. However, Rowhammer exploits on GPUs have been limited to injecting untargeted bit-flips in victim data like weights of machine learning models, to degrade model accuracy, unlike CPU exploits shown capable of privilege escalation. In this paper, we demonstrate that GPU Rowhammer exploits can be as potent as CPU Rowhammer attacks. By exploiting the GPU page table management to identify when and where new page tables are allocated, we enable an unprivileged user CUDA kernel of one process to use RowHammer bit-flips to gain access to the GPU memory of other processes or co-tenants via targeted tampering of such page-tables resident on the GPU memory. Using this newly found primitive, we demonstrate the first GPU-side privilege escalation attacks, leaking secret data such as cryptographic keys from cuPQC libraries, and even tampering with the model's GPU assembly code to degrade models more stealthily than previous attacks. We further demonstrate that GPU-side privilege escalation can lead to CPU-side privilege escalation, defeating the protections provided by the IOMMU, enabling a malicious user-level program with GPU access to gain root shell and system-wide control, even in a non-multi-tenant setting.

CRAug 28, 2024
ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Weizhou Wang, Eric Liu, Xiangyu Guo et al.

Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce ANVIL, which performs a masked code reconstruction task: the LLM reconstructs a masked line of code, and deviations from the original are scored as anomalies. We propose a hybrid anomaly score that combines exact match, cross-entropy loss, prediction confidence, and structural complexity. We evaluate our approach across multiple LLM families, scoring methods, and context sizes, and against vulnerabilities after the LLM's training cut-off. On the PrimeVul dataset, ANVIL outperforms state-of-the-art supervised detectors-LineVul, LineVD, and LLMAO-achieving up to 2x higher Top-3 accuracy, 75% better Normalized MFR, and a significant improvement on ROC-AUC. Finally, by integrating ANVIL with fuzzers, we uncover two previously unknown vulnerabilities, demonstrating the practical utility of anomaly-guided detection.

SDNov 26, 2025
HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko et al.

The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.

CLFeb 5, 2025
MARAGE: Transferable Multi-Model Adversarial Attack for Retrieval-Augmented Generation Data Extraction

Xiao Hu, Eric Liu, Weizhou Wang et al.

Retrieval-Augmented Generation (RAG) offers a solution to mitigate hallucinations in Large Language Models (LLMs) by grounding their outputs to knowledge retrieved from external sources. The use of private resources and data in constructing these external data stores can expose them to risks of extraction attacks, in which attackers attempt to steal data from these private databases. Existing RAG extraction attacks often rely on manually crafted prompts, which limit their effectiveness. In this paper, we introduce a framework called MARAGE for optimizing an adversarial string that, when appended to user queries submitted to a target RAG system, causes outputs containing the retrieved RAG data verbatim. MARAGE leverages a continuous optimization scheme that integrates gradients from multiple models with different architectures simultaneously to enhance the transferability of the optimized string to unseen models. Additionally, we propose a strategy that emphasizes the initial tokens in the target RAG data, further improving the attack's generalizability. Evaluations show that MARAGE consistently outperforms both manual and optimization-based baselines across multiple LLMs and RAG datasets, while maintaining robust transferability to previously unseen models. Moreover, we conduct probing tasks to shed light on the reasons why MARAGE is more effective compared to the baselines and to analyze the impact of our approach on the model's internal state.

CRNov 26, 2025
HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models

Kexin Li, Guozhen Ding, Ilya Grishchenko et al.

Modern generative diffusion models rely on vast training datasets, often including images with uncertain ownership or usage rights. Radioactive watermarks -- marks that transfer to a model's outputs -- can help detect when such unauthorized data has been used for training. Moreover, aside from being radioactive, an effective watermark for protecting images from unauthorized training also needs to meet other existing requirements, such as imperceptibility, robustness, and multi-bit capacity. To overcome these challenges, we propose HMARK, a novel multi-bit watermarking scheme, which encodes ownership information as secret bits in the semantic-latent space (h-space) for image diffusion models. By leveraging the interpretability and semantic significance of h-space, ensuring that watermark signals correspond to meaningful semantic attributes, the watermarks embedded by HMARK exhibit radioactivity, robustness to distortions, and minimal impact on perceptual quality. Experimental results demonstrate that HMARK achieves 98.57% watermark detection accuracy, 95.07% bit-level recovery accuracy, 100% recall rate, and 1.0 AUC on images produced by the downstream adversarial model finetuned with LoRA on watermarked data across various types of distortions.

HCMay 13, 2024
Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies

Mu-Huan Miles Chung, Sharon Li, Jaturong Kongmanee et al.

Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.

CLJan 16, 2024
Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

Wenjun Qiu, David Lie, Lisa Austin

A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving class and data category balance in the data set. The combination of these techniques allows Calpric to produce models that are accurate over a wider range of data categories, and provide more detailed, fine-grain labels than previous work. Our crowdsourcing process enables Calpric to attain reliable labeled data at a cost of roughly $0.92-$1.71 per labeled text segment. Calpric 's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.

SDAug 3, 2021
On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples

Adelin Travers, Lorna Licollari, Guanghan Wang et al.

Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human perceptibility, research has focused on methods that aim to produce imperceptible adversarial examples which humans cannot distinguish from the corresponding benign samples. We argue that this perspective is coarse for two reasons: 1. Imperceptibility is impossible to verify; it would require an experimental process that encompasses variations in listener training, equipment, volume, ear sensitivity, types of background noise etc, and 2. It disregards pipeline-based detection clues that realistic defenders leverage. This results in adversarial examples that are ineffective in the presence of knowledgeable defenders. Thus, an adversary only needs an audio sample to be plausible to a human. We thus introduce surreptitious adversarial examples, a new class of attacks that evades both human and pipeline controls. In the white-box setting, we instantiate this class with a joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user study, we show that this attack produces audio samples that are more surreptitious than previous attacks that aim solely for imperceptibility. Lastly we show that surreptitious adversarial examples are challenging to develop in the black-box setting.

CRAug 7, 2020
Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification

Wenjun Qiu, David Lie

Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most informative segments to be labeled. On average, our model is able to achieve the same F1 score using only 62% of the original labeling effort. Calpric's use of active learning also addresses naturally occurring class imbalance in unlabeled privacy policy datasets as there are many more statements stating the collection of private information than stating the absence of collection. By selecting samples from the minority class for labeling, Calpric automatically creates a more balanced training set.

CRJul 31, 2020
vWitness: Certifying Web Page Interactions with Computer Vision

He Shuang, Lianying Zhao, David Lie

Web servers service client requests, some of which might cause the web server to perform security-sensitive operations (e.g. money transfer, voting). An attacker may thus forge or maliciously manipulate such requests by compromising a web client. Unfortunately, a web server has no way of knowing whether the client from which it receives a request has been compromised or not -- current "best practice" defenses such as user authentication or network encryption cannot aid a server as they all assume web client integrity. To address this shortcoming, we propose vWitness, which "witnesses" the interactions of a user with a web page and certifies whether they match a specification provided by the web server, enabling the web server to know that the web request is user-intended. The main challenge that vWitness overcomes is that even benign clients introduce unpredictable variations in the way they render web pages. vWitness differentiates between these benign variations and malicious manipulation using computer vision, allowing it to certify to the web server that 1) the web page user interface is properly displayed 2) observed user interactions are used to construct the web request. Our vWitness prototype achieves compatibility with modern web pages, is resilient to adversarial example attacks and is accurate and performant -- vWitness achieves 99.97% accuracy and adds 197ms of overhead to the entire interaction session in the average case.

CRDec 9, 2019
Machine Unlearning

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo et al.

Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. We introduce SISA training, a framework that expedites the unlearning process by strategically limiting the influence of a data point in the training procedure. While our framework is applicable to any learning algorithm, it is designed to achieve the largest improvements for stateful algorithms like stochastic gradient descent for deep neural networks. SISA training reduces the computational overhead associated with unlearning, even in the worst-case setting where unlearning requests are made uniformly across the training set. In some cases, the service provider may have a prior on the distribution of unlearning requests that will be issued by users. We may take this prior into account to partition and order data accordingly, and further decrease overhead from unlearning. Our evaluation spans several datasets from different domains, with corresponding motivations for unlearning. Under no distributional assumptions, for simple learning tasks, we observe that SISA training improves time to unlearn points from the Purchase dataset by 4.63x, and 2.45x for the SVHN dataset, over retraining from scratch. SISA training also provides a speed-up of 1.36x in retraining for complex learning tasks such as ImageNet classification; aided by transfer learning, this results in a small degradation in accuracy. Our work contributes to practical data governance in machine unlearning.

CROct 11, 2019
SoK: Hardware Security Support for Trustworthy Execution

Lianying Zhao, He Shuang, Shengjie Xu et al.

In recent years, there have emerged many new hardware mechanisms for improving the security of our computer systems. Hardware offers many advantages over pure software approaches: immutability of mechanisms to software attacks, better execution and power efficiency and a smaller interface allowing it to better maintain secrets. This has given birth to a plethora of hardware mechanisms providing trusted execution environments (TEEs), support for integrity checking and memory safety and widespread uses of hardware roots of trust. In this paper, we systematize these approaches through the lens of abstraction. Abstraction is key to computing systems, and the interface between hardware and software contains many abstractions. We find that these abstractions, when poorly designed, can both obscure information that is needed for security enforcement, as well as reveal information that needs to be kept secret, leading to vulnerabilities. We summarize such vulnerabilities and discuss several research trends of this area.

CRNov 29, 2017
Sound Patch Generation for Vulnerabilities

Zhen Huang, David Lie

Security vulnerabilities are among the most critical software defects in existence. As such, they require patches that are correct and quickly deployed. This motivates an automatic patch generation method that emphasizes both soundness and wide applicability. To address this challenge, we propose Senx, which uses three novel patch generation techniques to create patches for out-of-bounds read/write vulnerabilities. Senx uses symbolic execution to extract expressions from the source code of a target application to synthesize patches. To reduce the runtime overhead of patches, it uses loop cloning and access range analysis to analyze loops involved in these vulnerabilities and elevate patches outside of loops. For vulnerabilities that span multiple functions, Senx uses expression translation to translate expressions and place them in a function scope where all values are available to create the patch. This enables Senx to patch vulnerabilities with complex loops and interprocedural dependencies that previous semantics-based patch generation systems cannot handle. We have implemented a prototype using this approach. Our evaluation shows that the patches generated by Senx successfully fix 76% of 42 real-world vulnerabilities from 11 applications including various tools or libraries for manipulating graphics/media files, a programming language interpreter, a relational database engine, a collection of programming tools for creating and managing binary programs, and a collection of basic file, shell, and text manipulation tools. All patches that Senx produces are sound, and Senx correctly aborts patch generations in cases where its analysis will fall short.

SENov 6, 2017
SAIC: Identifying Configuration Files for System Configuration Management

Zhen Huang, David Lie

Systems can become misconfigured for a variety of reasons such as operator errors or buggy patches. When a misconfiguration is discovered, usually the first order of business is to restore availability, often by undoing the misconfiguration. To simplify this task, we propose the Statistical Analysis for Identifying Configuration Files (SAIC), which analyzes how the contents of a file changes over time to automatically determine which files contain configuration state. In this way, SAIC reduces the number of files a user must manually examine during recovery and allows versioning file systems to make more efficient use of their versioning storage. The two key insights that enable SAIC to identify configuration files are that configuration state must persist across executions of an application and that configuration state changes at a slower rate than other types of application state. SAIC applies these insights through a set of filters, which eliminate non-persistent files from consideration, and a novel similarity metric, which measures how similar a file's versions are to each other. Together, these two mechanisms enable SAIC to identify all 72 configuration files out of 2363 versioned files from 6 common applications in two user traces, while mistaking only 33 non-configuration files as configuration files, which allows a versioning file system to eliminate roughly 66% of non-configuration file versions from its logs, thus reducing the number of file versions that a user must try to recover from a misconfiguration.

CRNov 2, 2017
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Response

Zhen Huang, Mariana D'Angelo, Dhaval Miyani et al.

Considerable delays often exist between the discovery of a vulnerability and the issue of a patch. One way to mitigate this window of vulnerability is to use a configuration workaround, which prevents the vulnerable code from being executed at the cost of some lost functionality -- but only if one is available. Since program configurations are not specifically designed to mitigate software vulnerabilities, we find that they only cover 25.2% of vulnerabilities. To minimize patch delay vulnerabilities and address the limitations of configuration workarounds, we propose Security Workarounds for Rapid Response (SWRRs), which are designed to neutralize security vulnerabilities in a timely, secure, and unobtrusive manner. Similar to configuration workarounds, SWRRs neutralize vulnerabilities by preventing vulnerable code from being executed at the cost of some lost functionality. However, the key difference is that SWRRs use existing error-handling code within programs, which enables them to be mechanically inserted with minimal knowledge of the program and minimal developer effort. This allows SWRRs to achieve high coverage while still being fast and easy to deploy. We have designed and implemented Talos, a system that mechanically instruments SWRRs into a given program, and evaluate it on five popular Linux server programs. We run exploits against 11 real-world software vulnerabilities and show that SWRRs neutralize the vulnerabilities in all cases. Quantitative measurements on 320 SWRRs indicate that SWRRs instrumented by Talos can neutralize 75.1% of all potential vulnerabilities and incur a loss of functionality similar to configuration workarounds in 71.3% of those cases. Our overall conclusion is that automatically generated SWRRs can safely mitigate 2.1x more vulnerabilities, while only incurring a loss of functionality comparable to that of traditional configuration workarounds.

SENov 2, 2017
Ocasta: Clustering Configuration Settings For Error Recovery

Zhen Huang, David Lie

Effective machine-aided diagnosis and repair of configuration errors continues to elude computer systems designers. Most of the literature targets errors that can be attributed to a single erroneous configuration setting. However, a recent study found that a significant amount of configuration errors require fixing more than one setting together. To address this limitation, Ocasta statistically clusters dependent configuration settings based on the application's accesses to its configuration settings and utilizes the extracted clustering of configuration settings to fix configuration errors involving more than one configuration settings. Ocasta treats applications as black-boxes and only relies on the ability to observe application accesses to their configuration settings. We collected traces of real application usage from 24 Linux and 5 Windows desktops computers and found that Ocasta is able to correctly identify clusters with 88.6% accuracy. To demonstrate the effectiveness of Ocasta, we evaluated it on 16 real-world configuration errors of 11 Linux and Windows applications. Ocasta is able to successfully repair all evaluated configuration errors in 11 minutes on average and only requires the user to examine an average of 3 screenshots of the output of the application to confirm that the error is repaired. A user study we conducted shows that Ocasta is easy to use by both expert and non-expert users and is more efficient than manual configuration error troubleshooting.

CROct 11, 2017
Unity 2.0: Secure and Durable Personal Cloud Storage

Beom Heyn Kim, Wei Huang, Afshar Ganjali et al.

While personal cloud storage services such as Dropbox, OneDrive, Google Drive and iCloud have become very popular in recent years, these services offer few security guarantees to users. These cloud services are aimed at end users, whose applications often assume a local file system storage, and thus require strongly consistent data. In addition, users usually access these services using personal computers and portable devices such as phones and tablets, which are upload bandwidth constrained and in many cases battery powered. Unity is a system that provides confidentiality, integrity, durability and strong consistency while minimizing the upload bandwidth of its clients. We find that Unity consumes minimal upload bandwidth for compute-heavy workload compared to NFS and Dropbox, while uses similar amount of upload bandwidth for write-heavy workload relative to NBD. Although read-heavy workload tends to consume more upload bandwidth with Unity, it is no more than an eighth of the size of blocks replicated and there is much room for optimization. Moreover, Unity provides flexibility to maintain multiple DEs to provide scalability for multiple devices to concurrently access the data with the minimal lease switch cost.

CROct 2, 2017
Prochlo: Strong Privacy for Analytics in the Crowd

Andrea Bittau, Úlfar Erlingsson, Petros Maniatis et al.

The large-scale monitoring of computer users' software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture---Encode, Shuffle, Analyze (ESA)---for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informed by our practical experiences with an existing, large deployment of privacy-preserving software monitoring. (cont.; see the paper)

CRFeb 24, 2017
Glimmers: Resolving the Privacy/Trust Quagmire

David Lie, Petros Maniatis

Many successful services rely on trustworthy contributions from users. To establish that trust, such services often require access to privacy-sensitive information from users, thus creating a conflict between privacy and trust. Although it is likely impractical to expect both absolute privacy and trustworthiness at the same time, we argue that the current state of things, where individual privacy is usually sacrificed at the altar of trustworthy services, can be improved with a pragmatic $Glimmer$ $of$ $Trust$, which allows services to validate user contributions in a trustworthy way without forfeiting user privacy. We describe how trustworthy hardware such as Intel's SGX can be used client-side -- in contrast to much recent work exploring SGX in cloud services -- to realize the Glimmer architecture, and demonstrate how this realization is able to resolve the tension between privacy and trust in a variety of cases.