LGNEMay 5, 2023

A technical note on bilinear layers for interpretability

arXiv:2305.03452v111 citations
Originality Incremental advance
AI Analysis

This work addresses interpretability issues in neural networks for researchers, offering incremental improvements by enabling deeper safety insights through formal circuit analysis.

The paper tackles the challenge of interpreting neural networks by proposing bilinear layers as a more mathematically analyzable alternative to standard MLPs, showing they can be expressed with linear operations and third-order tensors and integrate into transformer circuit frameworks.

The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts to find architectures that are more interpretable than standard multilayer perceptrons (MLPs) with elementwise activation functions. In this note, I examine bilinear layers, which are a type of MLP layer that are mathematically much easier to analyze while simultaneously performing better than standard MLPs. Although they are nonlinear functions of their input, I demonstrate that bilinear layers can be expressed using only linear operations and third order tensors. We can integrate this expression for bilinear layers into a mathematical framework for transformer circuits, which was previously limited to attention-only transformers. These results suggest that bilinear layers are easier to analyze mathematically than current architectures and thus may lend themselves to deeper safety insights by allowing us to talk more formally about circuits in neural networks. Additionally, bilinear layers may offer an alternative path for mechanistic interpretability through understanding the mechanisms of feature construction instead of enumerating a (potentially exponentially) large number of features in large models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes