LGAIMay 24

Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability

arXiv:2605.252254.6
AI Analysis

For mechanistic interpretability researchers, this provides a unified mathematical language for organizing patching experiments, though the empirical validation is limited to GPT-2-style models and the framework is largely a reformulation of existing techniques.

The paper develops a field-theoretic framework for activation patching in Transformers, treating the residual stream as a depth-token field and formulating patching as localized source insertion. Empirically, it identifies a bounded local linear regime, predicts patch effects from first-order sensitivities, and shows that prompt-induced residual displacements can transfer answer behavior.

Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally meaningful directions in Transformer activation space. This paper develops a field-theoretic framework for organizing and predicting such interventions. Treating the residual stream as a depth-token field, we formulate patching as localized source insertion, patch effects as sensitivity-field predictions, downstream propagation as empirical Green-function response, and patch selection as an adjoint variational problem. Empirically, we test the forward response theory in GPT-2-style autoregressive Transformers by applying localized residual-field interventions and observing the induced residual-field differences and logit-difference responses. We identify a bounded local linear regime; predict patch effects from first-order sensitivities across residual sites; measure structured anisotropic propagation across depth and token position; construct response descriptions from high-sensitivity sites and sliced Green operators; and show that prompt-induced residual displacements can transfer answer behavior. These results establish response objects, namely sensitivities, propagated fields, and Green-operator slices, as a practical language for organizing patching experiments and as the forward mathematical basis for formulating patch-site inference and cross-scale transfer.formulated.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes