LGAICLAug 24, 2025

Activation Transport Operators

arXiv:2508.17540v2
Originality Incremental advance
AI Analysis

This work addresses a gap in mechanistic interpretability for LLMs, offering tools for safety and debugging, but it is incremental as it builds on existing sparse-dictionary learning and activation patching methods.

The paper tackled the problem of understanding how features flow through the residual stream in transformer decoders, proposing Activation Transport Operators (ATO) to distinguish between linearly transported and synthesized features, with empirical results showing transport efficiency and subspace size estimates.

The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals $k$ layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes