LGAICLApr 9, 2024

Does Transformer Interpretability Transfer to RNNs?

arXiv:2404.05971v110 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This addresses the problem of model interpretability for researchers and practitioners working with emerging RNN architectures, though it is incremental as it adapts existing methods rather than introducing new ones.

The paper investigated whether interpretability methods designed for transformers, such as contrastive activation addition and the tuned lens, transfer effectively to modern RNNs like Mamba and RWKV, finding that most techniques work and can be improved by leveraging RNNs' compressed state.

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes