LGCVJan 26, 2025

Making Sense Of Distributed Representations With Activation Spectroscopy

arXiv:2501.15435v1h-index: 50
Originality Incremental advance
AI Analysis

This work addresses interpretability challenges for researchers and practitioners in deep learning, offering a novel approach to analyze distributed features, though it appears incremental as an extension of existing algorithms.

The paper tackles the problem of interpreting distributed representations in neural networks by proposing Activation Spectroscopy (ActSpec), a method that analyzes activation patterns using pseudo-Boolean Fourier spectra to detect and trace neuron subsets, with experimental validation on synthetic settings, MNIST, and sentiment analysis tasks.

In the study of neural network interpretability, there is growing evidence to suggest that relevant features are encoded across many neurons in a distributed fashion. Making sense of these distributed representations without knowledge of the network's encoding strategy is a combinatorial task that is not guaranteed to be tractable. This work explores one feasible path to both detecting and tracing the joint influence of neurons in a distributed representation. We term this approach Activation Spectroscopy (ActSpec), owing to its analysis of the pseudo-Boolean Fourier spectrum defined over the activation patterns of a network layer. The sub-network defined between a given layer and an output logit is cast as a special class of pseudo-Boolean function. The contributions of each subset of neurons in the specified layer can be quantified through the function's Fourier coefficients. We propose a combinatorial optimization procedure to search for Fourier coefficients that are simultaneously high-valued, and non-redundant. This procedure can be viewed as an extension of the Goldreich-Levin algorithm which incorporates additional problem-specific constraints. The resulting coefficients specify a collection of subsets, which are used to test the degree to which a representation is distributed. We verify our approach in a number of synthetic settings and compare against existing interpretability benchmarks. We conclude with a number of experimental evaluations on an MNIST classifier, and a transformer-based network for sentiment analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes