LGAINEAug 20, 2024

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

arXiv:2408.10920v145 citationsh-index: 23
Originality Incremental advance
AI Analysis

This work addresses a foundational problem in AI interpretability by revealing non-linear representations in RNNs, which is incremental but clarifies limitations of existing theories.

The paper presents a counterexample to the strong Linear Representation Hypothesis by showing that small gated RNNs learn to represent tokens using magnitude-based encodings rather than linear directions when trained to repeat sequences, challenging assumptions in interpretability research.

The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes