LGAINov 16, 2022

Engineering Monosemanticity in Toy Models

arXiv:2211.09169v116 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses interpretability in neural networks for researchers, but it is incremental as it focuses on preliminary attempts in toy models.

The researchers tackled the problem of engineering monosemantic neurons in toy models to aid interpretability, finding that models can be made more monosemantic without increasing loss by targeting specific loss minima with moderate negative biases, and that increasing neurons per layer enhances monosemanticity at higher computational cost.

In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes