MLLGPRSep 26, 2025

Concept activation vectors: a unifying view and adversarial attacks

arXiv:2509.22755v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses a critical issue in explainable AI for researchers and practitioners by exposing a previously overlooked vulnerability in CAV methods, though it is incremental in refining existing tools.

The paper tackles the problem of understanding how human-interpretable concepts are encoded in model latent spaces using Concept Activation Vectors (CAVs), and it reveals a vulnerability where CAVs depend on arbitrary non-concept distributions, enabling adversarial attacks.

Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes