LG AINov 16, 2022

Engineering Monosemanticity in Toy Models

Adam S. Jermyn, Nicholas Schiefer, Evan Hubinger

arXiv:2211.09169v114.616 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This work addresses interpretability in neural networks for researchers, but it is incremental as it focuses on preliminary attempts in toy models.

The researchers tackled the problem of engineering monosemantic neurons in toy models to aid interpretability, finding that models can be made more monosemantic without increasing loss by targeting specific loss minima with moderate negative biases, and that increasing neurons per layer enhances monosemanticity at higher computational cost.

In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

View on arXiv PDF Code

Similar