LG CV NEJul 30, 2025

Pulling Back the Curtain on ReLU Networks

arXiv:2507.22832v4h-index: 2

Originality Incremental advance

AI Analysis

This work addresses interpretability for deep learning researchers, offering a novel method to analyze network gradients and potentially leading to mechanistic insights, though it is incremental in building on existing gradient-based techniques.

The paper tackles the problem of understanding internal representations in ReLU networks by proposing 'excitation pullbacks' through soft gating in the backward pass, which reveals perceptual alignment in ImageNet-pretrained architectures and enables interpretable feature attributions.

Since any ReLU network is piecewise affine, its hidden units can be characterized by their pullbacks through the active subnetwork, i.e., by their gradients (up to bias terms). However, gradients of deeper neurons are notoriously misaligned, which obscures the network's internal representations. We posit that models do align gradients with data, yet this is concealed by the intrinsic noise of the ReLU hard gating. We validate this intuition by applying soft gating in the backward pass only, reducing the local impact of weakly excited neurons. The resulting modified gradients, which we call "excitation pullbacks", exhibit striking perceptual alignment on a number of ImageNet-pretrained architectures, while the rudimentary pixel-space gradient ascent quickly produces easily interpretable input- and target-specific features. Inspired by these findings, we formulate the "path stability" hypothesis, claiming that the binary activation patterns largely stabilize during training and get encoded in the pre-activation distribution of the final model. When true, excitation pullbacks become aligned with the gradients of a kernel machine that mainly determines the network's decision. This provides a theoretical justification for the apparent faithfulness of the feature attributions based on excitation pullbacks, potentially even leading to mechanistic interpretability of deep models. Incidentally, we give a possible explanation for the effectiveness of Batch Normalization and Deep Features, together with a novel perspective on the network's internal memory and generalization properties. We release the code and an interactive app for easier exploration of the excitation pullbacks.

View on arXiv PDF

Similar