Concept activation vectors: a unifying view and adversarial attacks
This work addresses a critical issue in explainable AI for researchers and practitioners by exposing a previously overlooked vulnerability in CAV methods, though it is incremental in refining existing tools.
The paper tackles the problem of understanding how human-interpretable concepts are encoded in model latent spaces using Concept Activation Vectors (CAVs), and it reveals a vulnerability where CAVs depend on arbitrary non-concept distributions, enabling adversarial attacks.
Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.