70.8DBApr 16
Dynamic read & write optimization with TurtleKVTony Astolfi, Vidya Silai, Darby Huye et al.
High read and write performance is important for generic key-value stores, which are foundational to modern applications and databases. Yet, achieving high performance for mixed and dynamic workloads is challenging due to fundamental trade-offs between memory use and I/O for retrieval and updates. Past work emphasizes the trade-off between read- and write-optimization as expressed through primary data structure, in combination with read-memory trade-off mechanisms like caching and filtering. This raises re-tuning costs as optimal trade-off targets change, due to restructuring of stored data. We show that write-memory trade-off mechanisms are under-developed in current designs, and propose a new approach to dynamic key-value store optimization using a novel read-/write-balanced on-disk structure, the TurtleTree, and flexible read-memory & write-memory tuning knobs. We describe how the design of TurtleKV, our prototype, avoids in-memory bottlenecks to achieve high performance across a wide range of tuning parameters. When evaluated using YCSB, TurtleKV matches state-of-the-art SplinterDB for inserts, and is 5x/12x faster than RockDB/WiredTiger. In mixed workloads, TurtleKV is 16-25% faster than SplinterDB, >4x RocksDB, and 3-6x WiredTiger. TurtleKV is 2-9x faster than the others for point-query workloads, and has the best scan performance of the write-optimized systems tested.
DBFeb 28, 2022
VaultDB: A Real-World Pilot of Secure Multi-Party Computation within a Clinical Research NetworkJennie Rogers, Elizabeth Adetoro, Johes Bater et al.
Electronic health records represent a rich and growing source of clinical data for research. Privacy, regulatory, and institutional concerns limit the speed and ease of sharing this data. VaultDB is a framework for securely computing SQL queries over private data from two or more sources. It evaluates queries using secure multiparty computation: cryptographic protocols that evaluate a function such that the only information revealed from running it is the query answer. We describe the development of a HIPAA-compliant version of VaultDB on the Chicago Area Patient Centered Outcomes Research Network (CAPriCORN). This multi-institutional clinical research network spans the electronic health records of nearly 13M patients over hundreds of clinics and hospitals in the Chicago metropolitan area. Our results from deploying at three health systems within this network show its efficiency and scalability for distributed clinical research analyses without moving patient records from their site of origin.
CRJan 16, 2022
Visualizing Privacy-Utility Trade-Offs in Differentially Private Data ReleasesPriyanka Nanayakkara, Johes Bater, Xi He et al.
Organizations often collect private data and release aggregate statistics for the public's benefit. If no steps toward preserving privacy are taken, adversaries may use released statistics to deduce unauthorized information about the individuals described in the private dataset. Differentially private algorithms address this challenge by slightly perturbing underlying statistics with noise, thereby mathematically limiting the amount of information that may be deduced from each data release. Properly calibrating these algorithms -- and in turn the disclosure risk for people described in the dataset -- requires a data curator to choose a value for a privacy budget parameter, $ε$. However, there is little formal guidance for choosing $ε$, a task that requires reasoning about the probabilistic privacy-utility trade-off. Furthermore, choosing $ε$ in the context of statistical inference requires reasoning about accuracy trade-offs in the presence of both measurement error and differential privacy (DP) noise. We present Visualizing Privacy (ViP), an interactive interface that visualizes relationships between $ε$, accuracy, and disclosure risk to support setting and splitting $ε$ among queries. As a user adjusts $ε$, ViP dynamically updates visualizations depicting expected accuracy and risk. ViP also has an inference setting, allowing a user to reason about the impact of DP noise on statistical inferences. Finally, we present results of a study where 16 research practitioners with little to no DP background completed a set of tasks related to setting $ε$ using both ViP and a control. We find that ViP helps participants more correctly answer questions related to judging the probability of where a DP-noised release is likely to fall and comparing between DP-noised and non-private confidence intervals.
DSJun 9, 2021
Prior-Aware Distribution Estimation for Differential PrivacyYuchao Tao, Johes Bater, Ashwin Machanavajjhala
Joint distribution estimation of a dataset under differential privacy is a fundamental problem for many privacy-focused applications, such as query answering, machine learning tasks and synthetic data generation. In this work, we examine the joint distribution estimation problem given two data points: 1) differentially private answers of a workload computed over private data and 2) a prior empirical distribution from a public dataset. Our goal is to find a new distribution such that estimating the workload using this distribution is as accurate as the differentially private answer, and the relative entropy, or KL divergence, of this distribution is minimized with respect to the prior distribution. We propose an approach based on iterative optimization for solving this problem. An application of our solution won second place in the NIST 2020 Differential Privacy Temporal Map Challenge, Sprint 2.
CRMar 29, 2021
DP-Sync: Hiding Update Patterns in Secure Outsourced Databases with Differential PrivacyChenghong Wang, Johes Bater, Kartik Nayak et al.
In this paper, we have introduced a new type of leakage associated with modern encrypted databases called update pattern leakage. We formalize the definition and security model of DP-Sync with DP update patterns. We also proposed the framework DP-Sync, which extends existing encrypted database schemes to DP-Sync with DP update patterns. DP-Sync guarantees that the entire data update history over the outsourced data structure is protected by differential privacy. This is achieved by imposing differentially-private strategies that dictate the data owner's synchronization of local~data.
DBJun 22, 2016
SMCQL: Secure Querying for Federated DatabasesJohes Bater, Gregory Elliott, Craig Eggen et al.
People and machines are collecting data at an unprecedented rate. Despite this newfound abundance of data, progress has been slow in sharing it for open science, business, and other data-intensive endeavors. Many such efforts are stymied by privacy concerns and regulatory compliance issues. For example, many hospitals are interested in pooling their medical records for research, but none may disclose arbitrary patient records to researchers or other healthcare providers. In this context we propose the Private Data Network (PDN), a federated database for querying over the collective data of mutually distrustful parties. In a PDN, each member database does not reveal its tuples to its peers nor to the query writer. Instead, the user submits a query to an honest broker that plans and coordinates its execution over multiple private databases using secure multiparty computation (SMC). Here, each database's query execution is oblivious, and its program counters and memory traces are agnostic to the inputs of others. We introduce a framework for executing PDN queries named SMCQL. This system translates SQL statements into SMC primitives to compute query results over the union of its source databases without revealing sensitive information about individual tuples to peer data providers or the honest broker. Only the honest broker and the querier receive the results of a PDN query. For fast, secure query evaluation, we explore a heuristics-driven optimizer that minimizes the PDN's use of secure computation and partitions its query evaluation into scalable slices.