CRMar 6, 2025
A Consensus Privacy Metrics Framework for Synthetic DataLisa Pilgram, Fida K. Dankar, Jorg Drechsler et al.
Synthetic data generation is one approach for sharing individual-level data. However, to meet legislative requirements, it is necessary to demonstrate that the individuals' privacy is adequately protected. There is no consolidated standard for measuring privacy in synthetic data. Through an expert panel and consensus process, we developed a framework for evaluating privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure, and their use is discouraged. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information about an individual without necessarily revealing their identity. The resultant framework provides precise recommendations for metrics that address these types of disclosures effectively. Our findings further present specific opportunities for future research that can help with widespread adoption of synthetic data.
CRFeb 10, 2022
A Note on the Misinterpretation of the US Census Re-identification AttackPaul Francis
In 2018, the US Census Bureau designed a new data reconstruction and re-identification attack and tested it against their 2010 data release. The specific attack executed by the Bureau allows an attacker to infer the race and ethnicity of respondents with average 75% precision for 85% of the respondents, assuming that the attacker knows the correct age, sex, and address of the respondents. They interpreted the attack as exceeding the Bureau's privacy standards, and so introduced stronger privacy protections for the 2020 Census in the form of the TopDown Algorithm (TDA). This paper demonstrates that race and ethnicity can be inferred from the TDA-protected census data with substantially better precision and recall, using less prior knowledge: only the respondents' address. Race and ethnicity can be inferred with average 75% precision for 98% of the respondents, and can be inferred with 100% precision for 11% of the respondents. The inference is done by simply assuming that the race/ethnicity of the respondent is that of the majority race/ethnicity for the respondent's census block. The conclusion to draw from this simple demonstration is NOT that the Bureau's data releases lack adequate privacy protections. Indeed it is the purpose of the data releases to allow this kind of inference. The problem, rather, is that the Bureau's criteria for measuring privacy is flawed and overly pessimistic.
CRJan 12, 2022
Diffix Elm: Simple DiffixPaul Francis, Sebastian Probst-Eide, David Wagner et al.
Historically, strong data anonymization requires substantial domain expertise and custom design for the given data set and use case. Diffix is an anonymization framework designed to make strong data anonymization available to non-experts. This paper describes Diffix Elm, a version of Diffix that is very easy to use at the expense of query features. We describe Diffix Elm, and show that it provides strong anonymity based on the General Data Protection Regulation (GDPR) criteria. This document is the third version of Diffix Elm. The second version added ceiling, round, and bucket\_width functions (in addition to floor). This document adds the ability to protect multiple different kinds of protected entities (a feature not found in earlier versions of Diffix). It also adds counting distinct values for any column (rather than only the AID column).
CRJun 6, 2018
Diffix-Birch: Extending Diffix-AspenPaul Francis, Sebastian Probst-Eide, Pawel Obrok et al.
A longstanding open problem is that of how to get high quality statistics through direct queries to databases containing information about individuals without revealing information specific to those individuals. Diffix is a framework for anonymous database query that adds noise based on the filter conditions in the query. A previous paper described the first version, called diffix-aspen. This version, diffix-birch, extends that description to include a wide variety of common features found in SQL. It describes attacks associated with various features, and the anonymization steps used to defend against those attacks. This paper describes diffix-birch, which was used for the bounty program sponsored by Aircloak starting December 2017.