LGSep 2, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

arXiv:2309.00941v2334 citations
Originality Incremental advance
AI Analysis

This provides incremental interpretability progress for researchers in AI and machine learning, specifically for understanding internal representations in self-supervised models.

The paper tackled the problem of interpreting decision-making processes in sequence models by discovering a linear representation of board states in an Othello-playing neural network, enabling model control through vector arithmetic.

How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes