CLOct 21, 2025

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Tsinghua
arXiv:2510.18462v22 citationsh-index: 35
Originality Incremental advance
AI Analysis

This addresses the problem of mechanistic interpretability for Transformer models, though it appears incremental as a novel method for an existing bottleneck.

The researchers tackled the challenge of attributing Transformer model behavior to internal computations by introducing DePass, a unified framework that decomposes hidden states into additive components and propagates them with fixed attention scores and MLP activations. The method achieves faithful, fine-grained attribution without auxiliary training, as validated across token-level, model component-level, and subspace-level tasks.

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes