CVLGApr 17, 2025

Disentangling Polysemantic Channels in Convolutional Neural Networks

arXiv:2504.12939v14 citationsh-index: 4Has Code2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses interpretability challenges for researchers and practitioners in mechanistic interpretability, though it is incremental as it builds on existing methods for analyzing CNN components.

The paper tackles the problem of polysemantic channels in convolutional neural networks (CNNs), which encode multiple concepts and hinder interpretability, by proposing an algorithm to disentangle them into single-concept channels, resulting in enhanced interpretability and improved feature visualizations.

Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes