LGNov 12, 2024

Tackling Polysemanticity with Neuron Embeddings

arXiv:2411.08166v14.61 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of interpreting complex neural network behaviors for researchers, though it appears incremental as it builds on existing methods like Sparse Auto-Encoders.

The paper tackles polysemanticity in neural networks by introducing neuron embeddings, which identify distinct semantic behaviors in a neuron's examples, and demonstrates this on GPT2-small with a UI for exploration.

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

View on arXiv PDF

Similar