CVAILGApr 9, 2024

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

arXiv:2404.06453v128 citationsh-index: 32Has CodeXAI4CV
Originality Incremental advance
AI Analysis

This addresses interpretability challenges for researchers in mechanistic interpretability, offering a novel approach to disentangle features, though it is incremental as it builds on existing circuit-based methods.

The paper tackles the problem of polysemantic neurons in deep neural networks, which encode multiple unrelated features and hinder interpretability, by presenting a method that decomposes such neurons into monosemantic virtual neurons through circuit identification, demonstrating improved disentanglement in ResNet models on ImageNet with CLIP-based evaluations.

The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes