DB CRDec 31, 2020

Kamino: Constraint-Aware Differentially Private Data Synthesis

Chang Ge, Shubhankar Mohapatra, Xi He, Ihab F. Ilyas

arXiv:2012.15713v215.256 citationsHas Code

Originality Incremental advance

AI Analysis

This work is significant for organizations needing to publish private data while maintaining its structural integrity for downstream analytical tasks, offering an incremental improvement over existing differentially private synthesis methods.

This paper introduces Kamino, a data synthesis system that addresses the challenge of preserving data structure and correlations (integrity constraints) while ensuring differential privacy. Kamino generates synthetic data that maintains these structural properties, outperforming state-of-the-art methods in usefulness for classification model training and marginal query answering.

Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis.

View on arXiv PDF Code

Similar