LGNov 12, 2024

Tackling Polysemanticity with Neuron Embeddings

arXiv:2411.08166v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of interpreting complex neural network behaviors for researchers, though it appears incremental as it builds on existing methods like Sparse Auto-Encoders.

The paper tackles polysemanticity in neural networks by introducing neuron embeddings, which identify distinct semantic behaviors in a neuron's examples, and demonstrates this on GPT2-small with a UI for exploration.

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes