Explaining the Explainer: A First Theoretical Analysis of LIME
This work addresses the need for interpretability in machine learning for sensitive applications, offering foundational insights into LIME's behavior, though it is incremental as it focuses on theoretical analysis rather than new methods.
The paper tackles the lack of theoretical understanding of LIME, a popular interpretability algorithm, by providing the first theoretical analysis, showing that for linear functions, LIME's coefficients are proportional to the gradient, confirming it discovers meaningful features, but also revealing that poor parameter choices can cause it to miss important features.
Machine learning is used more and more often for sensitive applications, sometimes replacing humans in critical decision-making processes. As such, interpretability of these algorithms is a pressing need. One popular algorithm to provide interpretability is LIME (Local Interpretable Model-Agnostic Explanation). In this paper, we provide the first theoretical analysis of LIME. We derive closed-form expressions for the coefficients of the interpretable model when the function to explain is linear. The good news is that these coefficients are proportional to the gradient of the function to explain: LIME indeed discovers meaningful features. However, our analysis also reveals that poor choices of parameters can lead LIME to miss important features.