CVAILGMar 7, 2025

Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

arXiv:2503.05522v11 citationsh-index: 32xAI
Originality Incremental advance
AI Analysis

This addresses the issue of concept entanglement in interpretability methods for neural networks, offering a solution for researchers and practitioners in explainable AI, though it is incremental as it builds on existing CAV frameworks.

The paper tackled the problem of correlated concepts producing entangled, non-orthogonal directions in Concept Activation Vectors (CAVs), which complicates interpretation and applications like activation steering. The result was a post-hoc disentanglement method that improved concept isolation, demonstrated through tasks such as inserting isolated concepts into images and suppressing shortcuts with reduced impact on correlated concepts in datasets like CelebA and FunnyBirds.

Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as "beard" and "necktie" within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes