LGCRFeb 9, 2021

$k$-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers

arXiv:2102.04763v255 citations
AI Analysis

This work addresses the practical problem of balancing data privacy through k-anonymity with the utility of machine learning models for data practitioners and researchers.

This paper systematically investigates how different k-anonymisation algorithms impact the performance of machine learning classifiers. It finds that classification performance generally degrades with stronger k-anonymity constraints, but the extent varies significantly by dataset and anonymisation method, with Mondrian showing the most promising properties for subsequent classification.

The protection of private information is a crucial issue in data-driven research and business contexts. Typically, techniques like anonymisation or (selective) deletion are introduced in order to allow data sharing, e. g. in the case of collaborative research endeavours. For use with anonymisation techniques, the $k$-anonymity criterion is one of the most popular, with numerous scientific publications on different algorithms and metrics. Anonymisation techniques often require changing the data and thus necessarily affect the results of machine learning models trained on the underlying data. In this work, we conduct a systematic comparison and detailed investigation into the effects of different $k$-anonymisation algorithms on the results of machine learning models. We investigate a set of popular $k$-anonymisation algorithms with different classifiers and evaluate them on different real-world datasets. Our systematic evaluation shows that with an increasingly strong $k$-anonymity constraint, the classification performance generally degrades, but to varying degrees and strongly depending on the dataset and anonymisation method. Furthermore, Mondrian can be considered as the method with the most appealing properties for subsequent classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes