CROct 21, 2020
Multi-Dimensional Randomized ResponseJosep Domingo-Ferrer, Jordi Soria-Comas
In our data world, a host of not necessarily trusted controllers gather data on individual subjects. To preserve her privacy and, more generally, her informational self-determination, the individual has to be empowered by giving her agency on her own data. Maximum agency is afforded by local anonymization, that allows each individual to anonymize her own data before handing them to the data controller. Randomized response (RR) is a local anonymization approach able to yield multi-dimensional full sets of anonymized microdata that are valid for exploratory analysis and machine learning. This is so because an unbiased estimate of the distribution of the true data of individuals can be obtained from their pooled randomized data. Furthermore, RR offers rigorous privacy guarantees. The main weakness of RR is the curse of dimensionality when applied to several attributes: as the number of attributes grows, the accuracy of the estimated true data distribution quickly degrades. We propose several complementary approaches to mitigate the dimensionality problem. First, we present two basic protocols, separate RR on each attribute and joint RR for all attributes, and discuss their limitations. Then we introduce an algorithm to form clusters of attributes so that attributes in different clusters can be viewed as independent and joint RR can be performed within each cluster. After that, we introduce an adjustment algorithm for the randomized data set that repairs some of the accuracy loss due to assuming independence between attributes when using RR separately on each attribute or due to assuming independence between clusters in cluster-wise RR. We also present empirical work to illustrate the proposed methods.
CRMar 6, 2018
Connecting Randomized Response, Post-Randomization, Differential Privacy and t-Closeness via Deniability and PermutationJosep Domingo-Ferrer, Jordi Soria-Comas
We explore some novel connections between the main privacy models in use and we recall a few known ones. We show these models to be more related than commonly understood, around two main principles: deniability and permutation. In particular, randomized response turns out to be very modern in spite of it having been introduced over 50 years ago: it is a local anonymization method and it allows understanding the protection offered by $ε$-differential privacy when $ε$ is increased to improve utility. A similar understanding on the effect of large $ε$ in terms of deniability is obtained from the connection between $ε$-differential privacy and t-closeness. Finally, the post-randomization method (PRAM) is shown to be viewable as permutation and to be connected with randomized response and differential privacy. Since the latter is also connected with t-closeness, it follows that the permutation principle can explain the guarantees offered by all those models. Thus, calibrating permutation is very relevant in anonymization, and we conclude by sketching two ways of doing it.
CRDec 7, 2016
Individual Differential Privacy: A Utility-Preserving Formulation of Differential Privacy GuaranteesJordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez et al.
Differential privacy is a popular privacy model within the research community because of the strong privacy guarantee it offers, namely that the presence or absence of any individual in a data set does not significantly influence the results of analyses on the data set. However, enforcing this strict guarantee in practice significantly distorts data and/or limits data uses, thus diminishing the analytical utility of the differentially private results. In an attempt to address this shortcoming, several relaxations of differential privacy have been proposed that trade off privacy guarantees for improved data utility. In this work, we argue that the standard formalization of differential privacy is stricter than required by the intuitive privacy guarantee it seeks. In particular, the standard formalization requires indistinguishability of results between any pair of neighbor data sets, while indistinguishability between the actual data set and its neighbor data sets should be enough. This limits the data controller's ability to adjust the level of protection to the actual data, hence resulting in significant accuracy loss. In this respect, we propose individual differential privacy, an alternative differential privacy notion that offers em the same privacy guarantees as standard differential privacy to individuals (even though not to groups of individuals). This new notion allows the data controller to adjust the distortion to the actual data set, which results in less distortion and more analytical accuracy. We propose several mechanisms to attain individual differential privacy and we compare the new notion against standard differential privacy in terms of the accuracy of the analytical results.
CRDec 9, 2015
t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility PreservationJordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez et al.
Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate $k$-anonymous data sets, where the identity of each subject is hidden within a group of $k$ subjects. Unlike generalization, microaggregation perturbs the data and this additional masking freedom allows improving data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data. $k$-Anonymity, on the other side, does not protect against attribute disclosure, which occurs if the variability of the confidential values in a group of $k$ subjects is too small. To address this issue, several refinements of $k$-anonymity have been proposed, among which $t$-closeness stands out as providing one of the strictest privacy guarantees. Existing algorithms to generate $t$-close data sets are based on generalization and suppression (they are extensions of $k$-anonymization algorithms based on the same principles). This paper proposes and shows how to use microaggregation to generate $k$-anonymous $t$-close data sets. The advantages of microaggregation are analyzed, and then several microaggregation algorithms for $k$-anonymous $t$-closeness are presented and empirically evaluated.
CRDec 9, 2015
Utility-Preserving Differentially Private Data Releases Via Individual Ranking MicroaggregationDavid Sánchez, Josep Domingo-Ferrer, Sergio Martínez et al.
Being able to release and exploit open data gathered in information systems is crucial for researchers, enterprises and the overall society. Yet, these data must be anonymized before release to protect the privacy of the subjects to whom the records relate. Differential privacy is a privacy model for anonymization that offers more robust privacy guarantees than previous models, such as $k$-anonymity and its extensions. However, it is often disregarded that the utility of differentially private outputs is quite limited, either because of the amount of noise that needs to be added to obtain them or because utility is only preserved for a restricted type and/or a limited number of queries. On the contrary, $k$-anonymity-like data releases make no assumptions on the uses of the protected data and, thus, do not restrict the number and type of doable analyses. Recently, some authors have proposed mechanisms to offer general-purpose differentially private data releases. This paper extends such works with a specific focus on the preservation of the utility of the protected data. Our proposal builds on microaggregation-based anonymization, which is more flexible and utility-preserving than alternative anonymization methods used in the literature, in order to reduce the amount of noise needed to satisfy differential privacy. In this way, we improve the utility of differentially private data releases. Moreover, the noise reduction we achieve does not depend on the size of the data set, but just on the number of attributes to be protected, which is a more desirable behavior for large data sets. The utility benefits brought by our proposal are empirically evaluated and compared with related works for several data sets and metrics.
CRJul 3, 2013
Improving data utility in differential privacy and k-anonymityJordi Soria-Comas
We focus on two mainstream privacy models: k-anonymity and differential privacy. Once a privacy model has been selected, the goal is to enforce it while preserving as much data utility as possible. The main objective of this thesis is to improve the data utility in k-anonymous and differentially private data releases. k-Anonymity has several drawbacks. On the disclosure limitation side, there is a lack of protection against attribute disclosure and against informed intruders. On the data utility side, dealing with a large number of quasi-identifier attributes is problematic. We propose a relaxation of k-anonymity that deals with these issues. Differential privacy limits disclosure risk through noise addition. The Laplace distribution is commonly used for the random noise. We show that the Laplace distribution is not optimal: the same disclosure limitation guarantee can be attained by adding less noise. Optimal univariate and multivariate noises are characterized and constructed. Common mechanisms to attain differential privacy do not take into account the users prior knowledge; they implicitly assume zero initial knowledge about the query response. We propose a mechanism that focuses on limiting the knowledge gain over the prior knowledge. Microaggregation-based k-anonymity and differential privacy can be combined to produce microdata releases with the strong privacy guarantees of differential privacy and improved data accuracy. The last contribution delves into the relation between t-closeness and differential privacy. We see that for a specific distance and under some reasonable assumptions on the intruders knowledge, t-closeness leads to differential privacy.