LGAIMLJan 12, 2023

Tracr: Compiled Transformers as a Laboratory for Interpretability

DeepMind
arXiv:2301.05062v598 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This provides a tool for researchers in AI interpretability to test and validate methods on models with known internal programs, addressing a key bottleneck in the field.

The authors tackled the problem of evaluating interpretability methods for transformer models by developing Tracr, a compiler that translates human-readable programs into standard decoder-only transformers with known structure, enabling experiments like studying superposition and providing ground-truth for method evaluation.

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes