LG AI CRAug 12, 2024

Fooling SHAP with Output Shuffling Attacks

arXiv:2408.06509v16.43 citationsh-index: 20

Originality Incremental advance

AI Analysis

This work addresses the vulnerability of fairness detection in AI systems for practitioners, but it is incremental as it builds on prior adversarial attack methods by relaxing data constraints.

The paper tackles the problem of adversarial attacks on SHAP-based explainable AI methods, which can hide unfair model behavior by manipulating feature attributions without needing data distribution access, and demonstrates that while Shapley values theoretically cannot detect these attacks, practical algorithms like linear SHAP and SHAP can do so with varying effectiveness on real-world datasets.

Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

View on arXiv PDF

Similar