LGOct 11, 2024

Evolution of SAE Features Across Layers in LLMs

arXiv:2410.08869v29 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This work provides insights into feature evolution in LLMs, which is incremental for researchers in interpretability.

The study analyzed how features in Sparse Autoencoders evolve across layers in transformer-based language models, finding that many features are passed through, some combine quasi-booleanly, and others become more specialized in later layers.

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors (https://stefanhex.com/spar-2024/feature-browser/), and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes