LGCLFeb 5, 2025

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

arXiv:2502.03032v37 citationsh-index: 7ICML
Originality Incremental advance
AI Analysis

This provides a causal interpretability framework for researchers and practitioners working with large language models, though it builds incrementally on prior work on inter-layer feature links.

The paper tackled the problem of interpreting and controlling large language models by mapping features across layers to understand their evolution, and demonstrated that this enables targeted steering of text generation.

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes