Jennie Rogers

DB
5papers
152citations
Novelty51%
AI Score42

5 Papers

43.3DBMay 27
ScanTwin: Simulating Performance Regressions Without Access to Tenant Data

Donghyun Sohn, Jennie Rogers

In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce these regressions locally. Current methods for synthetic data usually focus on statistical properties, such as matching data distributions or improving query accuracy. However, they overlook the physical properties that control how the engine behaves during scans, including row-group pruning. We propose ScanTwin, a lightweight framework that extracts a per-row-group sketch from the Parquet footer, including boundary values and compressed sizes, and releases them under $\varepsilon$-differential privacy using a boundary parameterization. On TPC-H and SSB (6M rows), ScanTwin achieves 0% pruning error and less than 1% byte error at $\varepsilon{=}\infty$. Under $\varepsilon{=}5$, high-selectivity queries ($>$30%) incur below 8.5% pruning error on both datasets, and per-query scan timing on DuckDB closely tracks the original.

DBFeb 28, 2022
VaultDB: A Real-World Pilot of Secure Multi-Party Computation within a Clinical Research Network

Jennie Rogers, Elizabeth Adetoro, Johes Bater et al.

Electronic health records represent a rich and growing source of clinical data for research. Privacy, regulatory, and institutional concerns limit the speed and ease of sharing this data. VaultDB is a framework for securely computing SQL queries over private data from two or more sources. It evaluates queries using secure multiparty computation: cryptographic protocols that evaluate a function such that the only information revealed from running it is the query answer. We describe the development of a HIPAA-compliant version of VaultDB on the Chicago Area Patient Centered Outcomes Research Network (CAPriCORN). This multi-institutional clinical research network spans the electronic health records of nearly 13M patients over hundreds of clinics and hospitals in the Chicago metropolitan area. Our results from deploying at three health systems within this network show its efficiency and scalability for distributed clinical research analyses without moving patient records from their site of origin.

CRJan 16, 2022
Visualizing Privacy-Utility Trade-Offs in Differentially Private Data Releases

Priyanka Nanayakkara, Johes Bater, Xi He et al.

Organizations often collect private data and release aggregate statistics for the public's benefit. If no steps toward preserving privacy are taken, adversaries may use released statistics to deduce unauthorized information about the individuals described in the private dataset. Differentially private algorithms address this challenge by slightly perturbing underlying statistics with noise, thereby mathematically limiting the amount of information that may be deduced from each data release. Properly calibrating these algorithms -- and in turn the disclosure risk for people described in the dataset -- requires a data curator to choose a value for a privacy budget parameter, $ε$. However, there is little formal guidance for choosing $ε$, a task that requires reasoning about the probabilistic privacy-utility trade-off. Furthermore, choosing $ε$ in the context of statistical inference requires reasoning about accuracy trade-offs in the presence of both measurement error and differential privacy (DP) noise. We present Visualizing Privacy (ViP), an interactive interface that visualizes relationships between $ε$, accuracy, and disclosure risk to support setting and splitting $ε$ among queries. As a user adjusts $ε$, ViP dynamically updates visualizations depicting expected accuracy and risk. ViP also has an inference setting, allowing a user to reason about the impact of DP noise on statistical inferences. Finally, we present results of a study where 16 research practitioners with little to no DP background completed a set of tasks related to setting $ε$ using both ViP and a control. We find that ViP helps participants more correctly answer questions related to judging the probability of where a DP-noised release is likely to fall and comparing between DP-noised and non-private confidence intervals.

DBMar 31, 2019
KloakDB: A Platform for Analyzing Sensitive Data with $K$-anonymous Query Processing

Madhav Suresh, Zuohao She, William Wallace et al.

A private data federation enables data owners to pool their information for querying without disclosing their secret tuples to one another. Here, a client queries the union of the records of all data owners. The data owners work together to answer the query using privacy-preserving algorithms that prevent them from learning unauthorized information about the inputs of their peers. Only the client, and a federation coordinator, learn the query's output. KloakDB is a private data federation that uses trusted hardware to process SQL queries over the inputs of two or more parties. Currently private data federations compute their queries fully-obliviously, guaranteeing that no information is revealed about the sensitive inputs of a data owner to their peers by observing the query's instruction traces and memory access patterns. Oblivious querying almost always exacts multiple orders of magnitude slowdown in query runtimes compared to plaintext execution, making it impractical for many applications. KloakDB offers a semi-oblivious computing framework, $k$-anonymous query processing. We make the query's observable transcript $k$-anonymous because it is a popular standard for data release in many domains including medicine, educational research, and government data. KloakDB's queries run such that each data owner may deduce information about no fewer than $k$ individuals in the data of their peers. In addition, stakeholders set $k$, creating a novel trade-off between privacy and performance. Our results show that KloakDB enjoys speedups of up to $117$X using k-anonymous query processing over full-oblivious evaluation.

DBJun 22, 2016
SMCQL: Secure Querying for Federated Databases

Johes Bater, Gregory Elliott, Craig Eggen et al.

People and machines are collecting data at an unprecedented rate. Despite this newfound abundance of data, progress has been slow in sharing it for open science, business, and other data-intensive endeavors. Many such efforts are stymied by privacy concerns and regulatory compliance issues. For example, many hospitals are interested in pooling their medical records for research, but none may disclose arbitrary patient records to researchers or other healthcare providers. In this context we propose the Private Data Network (PDN), a federated database for querying over the collective data of mutually distrustful parties. In a PDN, each member database does not reveal its tuples to its peers nor to the query writer. Instead, the user submits a query to an honest broker that plans and coordinates its execution over multiple private databases using secure multiparty computation (SMC). Here, each database's query execution is oblivious, and its program counters and memory traces are agnostic to the inputs of others. We introduce a framework for executing PDN queries named SMCQL. This system translates SQL statements into SMC primitives to compute query results over the union of its source databases without revealing sensitive information about individual tuples to peer data providers or the honest broker. Only the honest broker and the querier receive the results of a PDN query. For fast, secure query evaluation, we explore a heuristics-driven optimizer that minimizes the PDN's use of secure computation and partitions its query evaluation into scalable slices.