Ken R. Duffy

IT
7papers
110citations
Novelty51%
AI Score40

7 Papers

LGMar 31, 2023
PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira et al. · berkeley, mit

Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

61.5QUANT-PHMar 18
Efficient Soft-Output Guessing for Enhanced Quantum Tanner Code Decoding

Lukas Rapp, Muriel Médard, Eugene Tang et al.

We introduce a generalized low-density parity-check decoding framework for quantum Tanner codes utilizing soft-output guessing random additive noise decoding (SOGRAND). By soft-output decoding entire component codes, we mitigate trapping sets and cycles, resulting in improved convergence. SOGRAND, combined with ordered statistic decoding (OSD) post-processing, outperforms the standard belief propagation plus OSD baseline by up to three orders of magnitude in logical error rate, providing a way forward for scalable decoding of the emerging class of Tanner-code-based quantum codes.

ITFeb 7, 2022
Partial Encryption after Encoding for Security and Reliability in Data Systems

Alejandro Cohen, Rafael G. L. D'Oliveira, Ken R. Duffy et al.

We consider the problem of secure and reliable communication over a noisy multipath network. Previous work considering a noiseless version of our problem proposed a hybrid universal network coding cryptosystem (HUNCC). By combining an information-theoretically secure encoder together with partial encryption, HUNCC is able to obtain security guarantees, even in the presence of an all-observing eavesdropper. In this paper, we propose a version of HUNCC for noisy channels (N-HUNCC). This modification requires four main novelties. First, we present a network coding construction which is jointly, individually secure and error-correcting. Second, we introduce a new security definition which is a computational analogue of individual security, which we call individual indistinguishability under chosen ciphertext attack (individual IND-CCA1), and show that NHUNCC satisfies it. Third, we present a noise based decoder for N-HUNCC, which permits the decoding of the encoded-thenencrypted data. Finally, we discuss how to select parameters for N-HUNCC and its error-correcting capabilities.

LGJan 28, 2022
Syfer: Neural Obfuscation for Private Data Release

Adam Yala, Victor Quach, Homa Esfahanizadeh et al.

Balancing privacy and predictive utility remains a central challenge for machine learning in healthcare. In this paper, we develop Syfer, a neural obfuscation method to protect against re-identification attacks. Syfer composes trained layers with random neural networks to encode the original data (e.g. X-rays) while maintaining the ability to predict diagnoses from the encoded data. The randomness in the encoder acts as the private key for the data owner. We quantify privacy as the number of attacker guesses required to re-identify a single image (guesswork). We propose a contrastive learning algorithm to estimate guesswork. We show empirically that differentially private methods, such as DP-Image, obtain privacy at a significant loss of utility. In contrast, Syfer achieves strong privacy while preserving utility. For example, X-ray classifiers built with DP-image, Syfer, and original data achieve average AUCs of 0.53, 0.78, and 0.86, respectively.

CRJun 4, 2021
NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training

Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira et al.

Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital cohorts and limited the utilization of external machine learning resources. To remedy this, new methods are required to enable data owners, such as hospitals, to share their datasets publicly, while preserving both patient privacy and modeling utility. We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner, and publishes both the encoded data and associated labels publicly. From a theoretical perspective, we demonstrate that sampling from a sufficiently rich family of encoding functions offers a well-defined and meaningful notion of privacy against a computationally unbounded adversary with full knowledge of the underlying data-distribution. We propose to approximate this family of encoding functions through random deep neural networks. Empirically, we demonstrate the robustness of our encoding to a suite of adversarial attacks and show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks. Moreover, we demonstrate that multiple hospitals, using independent private encoders, can collaborate to train improved x-ray models. Finally, we release a challenge dataset to encourage the development of new attacks on NeuraCrypt.

ITOct 2, 2017
Privacy with Estimation Guarantees

Hao Wang, Lisa Vo, Flavio P. Calmon et al.

We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. In this setting, we present an estimation-theoretic analysis of the privacy-utility trade-off (PUT). Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how chi-square information captures the fundamental PUT in this case and provide bounds for the best PUT. We propose a convex program to compute privacy-assuring mappings when the functions to be disclosed and hidden are known a priori and the data distribution is known. We derive lower bounds on the minimum mean-squared error of estimating a target function from the disclosed data and evaluate the robustness of our approach when an empirical distribution is used to compute the privacy-assuring mappings instead of the true data distribution. We illustrate the proposed approach through two numerical experiments.

ITJan 27, 2013
Brute force searching, the typical set and Guesswork

Mark M. Christiansen, Ken R. Duffy, Flavio du Pin Calmon et al.

Consider the situation where a word is chosen probabilistically from a finite list. If an attacker knows the list and can inquire about each word in turn, then selecting the word via the uniform distribution maximizes the attacker's difficulty, its Guesswork, in identifying the chosen word. It is tempting to use this property in cryptanalysis of computationally secure ciphers by assuming coded words are drawn from a source's typical set and so, for all intents and purposes, uniformly distributed within it. By applying recent results on Guesswork, for i.i.d. sources it is this equipartition ansatz that we investigate here. In particular, we demonstrate that the expected Guesswork for a source conditioned to create words in the typical set grows, with word length, at a lower exponential rate than that of the uniform approximation, suggesting use of the approximation is ill-advised.