Ken R. Duffy

h-index33

11papers

4,813citations

Novelty53%

AI Score42

Ranked #63,581 of 194,257 authors (top 33%)#182 in IT (top 24%)

11 Papers

2.0LGMar 31, 2023

PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira et al. · berkeley, mit

Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

8.0QUANT-PHMar 18

Efficient Soft-Output Guessing for Enhanced Quantum Tanner Code Decoding

Lukas Rapp, Muriel Médard, Eugene Tang et al.

We introduce a generalized low-density parity-check decoding framework for quantum Tanner codes utilizing soft-output guessing random additive noise decoding (SOGRAND). By soft-output decoding entire component codes, we mitigate trapping sets and cycles, resulting in improved convergence. SOGRAND, combined with ordered statistic decoding (OSD) post-processing, outperforms the standard belief propagation plus OSD baseline by up to three orders of magnitude in logical error rate, providing a way forward for scalable decoding of the emerging class of Tanner-code-based quantum codes.

5.9NIJun 13

Sub Terahertz LEO Satellite Communication: Vision, Opportunities, and Challenges toward the First Prototype in Space

Sergi Aliaga, Vitaly Petrov, Andrew Benincasa et al.

The landscape of sub-terahertz (sub-THz, 100GHz - 300GHz) wireless technology evolved drastically over the last two decades - from only a few niche use cases in sensing and ultra-short-range communications in early 2000s toward operational multi-kilometer range 100GBbit/s+ wireless backhaul links demonstrated recently. Building on this momentum, this article explores the feasibility of extending sub-THz communications to 100-km-scale satellite links. We first assess the technological readiness of emerging sub-THz hardware and signal-processing techniques, highlighting their potential to support long-range operation in low-Earth-orbit (LEO) systems. We then outline the unique role that sub-THz links can play as a complementary solution to existing millimeter-wave and optical (``laser'') satellite technologies, offering additional capacity, improved resilience, and new architectural flexibility. We further discuss open research and engineering challenges toward implementing such sub-THz satellite communication systems in practice. We finally outline the key state-of-the-art solutions and the roadmap of TeraLink, an ongoing international R&D project aiming to build and launch, through an approved NASA CSLI space mission, the first hardware prototype of sub-THz LEO satellite communications in space.

2.3ITFeb 7, 2022

Partial Encryption after Encoding for Security and Reliability in Data Systems

Alejandro Cohen, Rafael G. L. D'Oliveira, Ken R. Duffy et al.

We consider the problem of secure and reliable communication over a noisy multipath network. Previous work considering a noiseless version of our problem proposed a hybrid universal network coding cryptosystem (HUNCC). By combining an information-theoretically secure encoder together with partial encryption, HUNCC is able to obtain security guarantees, even in the presence of an all-observing eavesdropper. In this paper, we propose a version of HUNCC for noisy channels (N-HUNCC). This modification requires four main novelties. First, we present a network coding construction which is jointly, individually secure and error-correcting. Second, we introduce a new security definition which is a computational analogue of individual security, which we call individual indistinguishability under chosen ciphertext attack (individual IND-CCA1), and show that NHUNCC satisfies it. Third, we present a noise based decoder for N-HUNCC, which permits the decoding of the encoded-thenencrypted data. Finally, we discuss how to select parameters for N-HUNCC and its error-correcting capabilities.

4.6LGJan 28, 2022

Syfer: Neural Obfuscation for Private Data Release

Adam Yala, Victor Quach, Homa Esfahanizadeh et al.

Balancing privacy and predictive utility remains a central challenge for machine learning in healthcare. In this paper, we develop Syfer, a neural obfuscation method to protect against re-identification attacks. Syfer composes trained layers with random neural networks to encode the original data (e.g. X-rays) while maintaining the ability to predict diagnoses from the encoded data. The randomness in the encoder acts as the private key for the data owner. We quantify privacy as the number of attacker guesses required to re-identify a single image (guesswork). We propose a contrastive learning algorithm to estimate guesswork. We show empirically that differentially private methods, such as DP-Image, obtain privacy at a significant loss of utility. In contrast, Syfer achieves strong privacy while preserving utility. For example, X-ray classifiers built with DP-image, Syfer, and original data achieve average AUCs of 0.53, 0.78, and 0.86, respectively.

17.0CRJun 4, 2021Code

NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training

Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira et al.

Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital cohorts and limited the utilization of external machine learning resources. To remedy this, new methods are required to enable data owners, such as hospitals, to share their datasets publicly, while preserving both patient privacy and modeling utility. We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner, and publishes both the encoded data and associated labels publicly. From a theoretical perspective, we demonstrate that sampling from a sufficiently rich family of encoding functions offers a well-defined and meaningful notion of privacy against a computationally unbounded adversary with full knowledge of the underlying data-distribution. We propose to approximate this family of encoding functions through random deep neural networks. Empirically, we demonstrate the robustness of our encoding to a suite of adversarial attacks and show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks. Moreover, we demonstrate that multiple hospitals, using independent private encoders, can collaborate to train improved x-ray models. Finally, we release a challenge dataset to encourage the development of new attacks on NeuraCrypt.

2.3ITDec 25, 2017

Guesswork Subject to a Total Entropy Budget

Arman Rezaee, Ahmad Beirami, Ali Makhdoumi et al.

We consider an abstraction of computational security in password protected systems where a user draws a secret string of given length with i.i.d. characters from a finite alphabet, and an adversary would like to identify the secret string by querying, or guessing, the identity of the string. The concept of a "total entropy budget" on the chosen word by the user is natural, otherwise the chosen password would have arbitrary length and complexity. One intuitively expects that a password chosen from the uniform distribution is more secure. This is not the case, however, if we are considering only the average guesswork of the adversary when the user is subject to a total entropy budget. The optimality of the uniform distribution for the user's secret string holds when we have also a budget on the guessing adversary. We suppose that the user is subject to a "total entropy budget" for choosing the secret string, whereas the computational capability of the adversary is determined by his "total guesswork budget." We study the regime where the adversary's chances are exponentially small in guessing the secret string chosen subject to a total entropy budget. We introduce a certain notion of uniformity and show that a more uniform source will provide better protection against the adversary in terms of his chances of success in guessing the secret string. In contrast, the average number of queries that it takes the adversary to identify the secret string is smaller for the more uniform secret string subject to the same total entropy budget.

8.6ITOct 2, 2017

Privacy with Estimation Guarantees

Hao Wang, Lisa Vo, Flavio P. Calmon et al.

We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. In this setting, we present an estimation-theoretic analysis of the privacy-utility trade-off (PUT). Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how chi-square information captures the fundamental PUT in this case and provide bounds for the best PUT. We propose a convex program to compute privacy-assuring mappings when the functions to be disclosed and hidden are known a priori and the data distribution is known. We derive lower bounds on the minimum mean-squared error of estimating a target function from the disclosed data and evaluate the robustness of our approach when an empirical distribution is used to compute the privacy-assuring mappings instead of the true data distribution. We illustrate the proposed approach through two numerical experiments.

3.6MLJun 15, 2016

Network Maximal Correlation

Soheil Feizi, Ali Makhdoumi, Ken Duffy et al.

We introduce Network Maximal Correlation (NMC) as a multivariate measure of nonlinear association among random variables. NMC is defined via an optimization that infers transformations of variables by maximizing aggregate inner products between transformed variables. For finite discrete and jointly Gaussian random variables, we characterize a solution of the NMC optimization using basis expansion of functions over appropriate basis functions. For finite discrete variables, we propose an algorithm based on alternating conditional expectation to determine NMC. Moreover we propose a distributed algorithm to compute an approximation of NMC for large and dense graphs using graph partitioning. For finite discrete variables, we show that the probability of discrepancy greater than any given level between NMC and NMC computed using empirical distributions decays exponentially fast as the sample size grows. For jointly Gaussian variables, we show that under some conditions the NMC optimization is an instance of the Max-Cut problem. We then illustrate an application of NMC in inference of graphical model for bijective functions of jointly Gaussian variables. Finally, we show NMC's utility in a data application of learning nonlinear dependencies among genes in a cancer dataset.

2.3ITJan 27, 2013

Brute force searching, the typical set and Guesswork

Mark M. Christiansen, Ken R. Duffy, Flavio du Pin Calmon et al.

Consider the situation where a word is chosen probabilistically from a finite list. If an attacker knows the list and can inquire about each word in turn, then selecting the word via the uniform distribution maximizes the attacker's difficulty, its Guesswork, in identifying the chosen word. It is tempting to use this property in cryptanalysis of computationally secure ciphers by assuming coded words are drawn from a source's typical set and so, for all intents and purposes, uniformly distributed within it. By applying recent results on Guesswork, for i.i.d. sources it is this equipartition ansatz that we investigate here. In particular, we demonstrate that the expected Guesswork for a source conditioned to create words in the typical set grows, with word length, at a lower exponential rate than that of the uniform approximation, suggesting use of the approximation is ill-advised.

5.1ITOct 8, 2012

Lists that are smaller than their parts: A coding approach to tunable secrecy

Flavio du Pin Calmon, Muriel Médard, Linda M. Zeger et al.

We present a new information-theoretic definition and associated results, based on list decoding in a source coding setting. We begin by presenting list-source codes, which naturally map a key length (entropy) to list size. We then show that such codes can be analyzed in the context of a novel information-theoretic metric, ε-symbol secrecy, that encompasses both the one-time pad and traditional rate-based asymptotic metrics, but, like most cryptographic constructs, can be applied in non-asymptotic settings. We derive fundamental bounds for ε-symbol secrecy and demonstrate how these bounds can be achieved with MDS codes when the source is uniformly distributed. We discuss applications and implementation issues of our codes.