To Trust or Not To Trust Prediction Scores for Membership Inference Attacks
This work addresses privacy concerns in machine learning by re-evaluating the effectiveness of membership inference attacks, indicating incremental insights for security researchers and practitioners.
The paper challenges the assumption that prediction scores reliably indicate training data membership in membership inference attacks (MIAs), showing that overconfidence in modern deep networks leads to high false-positive rates and can act as a defense against MIAs, with generative adversarial networks producing infinite false samples. It reveals a trade-off where low-confidence predictions increase susceptibility to MIAs, suggesting the threat is overestimated and less information is leaked than previously thought.
Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.