ML APDec 18, 2013

Perturbed Gibbs Samplers for Synthetic Data Release

arXiv:1312.5370v14 citations

Originality Incremental advance

AI Analysis

This addresses privacy concerns in data sharing for researchers and organizations handling sensitive categorical data, though it appears incremental as an extension of existing multiple imputation strategies.

The authors tackled the problem of generating synthetic categorical data with quantifiable disclosure risk, proposing a Perturbed Gibbs Sampler algorithm that handles high-dimensional data. They demonstrated its effectiveness on California Patient Discharge data, showing comparable statistical properties to original data while evaluating disclosure risks through simulated intruder scenarios.

We propose a categorical data synthesizer with a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler, can handle high-dimensional categorical data that are often intractable to represent as contingency tables. The algorithm extends a multiple imputation strategy for fully synthetic data by utilizing feature hashing and non-parametric distribution approximations. California Patient Discharge data are used to demonstrate statistical properties of the proposed synthesizing methodology. Marginal and conditional distributions, as well as the coefficients of regression models built on the synthesized data are compared to those obtained from the original data. Intruder scenarios are simulated to evaluate disclosure risks of the synthesized data from multiple angles. Limitations and extensions of the proposed algorithm are also discussed.

View on arXiv PDF

Similar