LGCVMar 11, 2025

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

arXiv:2503.08636v24 citationsh-index: 35Mach learn
Originality Incremental advance
AI Analysis

This work highlights critical vulnerabilities in interpretable models, questioning their trustworthiness for applications where reliability is essential, and is incremental in exposing specific risks rather than proposing new solutions.

The paper tackles the problem of verifying the robustness and interpretability of intrinsically interpretable deep learning models, showing that they are susceptible to adversarial attacks like prototype manipulation and backdoor attacks, which can fool the model's reasoning and create a false sense of security.

A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of part-prototype networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes