Marginality: a numerical mapping for enhanced treatment of nominal and hierarchical attributes
This addresses a specific bottleneck in privacy-preserving data mining for researchers and practitioners handling categorical data, representing an incremental technical improvement.
The paper tackles the limited choice of statistical disclosure control methods for categorical data by introducing a numerical mapping for hierarchical nominal data that enables computation of means, variances, and covariances, thereby enhancing data anonymization capabilities.
The purpose of statistical disclosure control (SDC) of microdata, a.k.a. data anonymization or privacy-preserving data mining, is to publish data sets containing the answers of individual respondents in such a way that the respondents corresponding to the released records cannot be re-identified and the released data are analytically useful. SDC methods are either based on masking the original data, generating synthetic versions of them or creating hybrid versions by combining original and synthetic data. The choice of SDC methods for categorical data, especially nominal data, is much smaller than the choice of methods for numerical data. We mitigate this problem by introducing a numerical mapping for hierarchical nominal data which allows computing means, variances and covariances on them.