Improving LIME Robustness with Smarter Locality Sampling
This work addresses robustness issues in explainable AI for commercial applications, but it is incremental as it builds on existing LIME methods.
The paper tackled the problem of LIME's vulnerability to adversarial exploitation due to naive sampling, and the result was a method that improved detection accuracy of biased behavior across three datasets while maintaining explanation quality, achieving up to 99.94% top-1 accuracy in some cases.
Explainability algorithms such as LIME have enabled machine learning systems to adopt transparency and fairness, which are important qualities in commercial use cases. However, recent work has shown that LIME's naive sampling strategy can be exploited by an adversary to conceal biased, harmful behavior. We propose to make LIME more robust by training a generative adversarial network to sample more realistic synthetic data which the explainer uses to generate explanations. Our experiments demonstrate that our proposed method demonstrates an increase in accuracy across three real-world datasets in detecting biased, adversarial behavior compared to vanilla LIME. This is achieved while maintaining comparable explanation quality, with up to 99.94\% in top-1 accuracy in some cases.