LG CVNov 8, 2022

Much Easier Said Than Done: Falsifying the Causal Relevance of Linear Decoding Methods

Lucas Hayne, Abhijit Suresh, Hunar Jain, Rahul Kumar, R. McKell Carter

arXiv:2211.04367v11.81 citationsh-index: 15

Originality Incremental advance

AI Analysis

This work addresses interpretability for neural network researchers by revealing limitations in current methods, though it is incremental as it builds on existing ablation and probing techniques.

The study tackled the problem of interpreting neural network function using linear classifier probes, finding that highly selective units identified by probes often do not cause significant performance deficits when ablated, and that an interaction between selectivity and average activity better predicts ablation effects across multiple networks. The result shows a weak relationship between probe-identified and ablation-important units, with linear decoders being effective due to partial overlap with causally important units.

Linear classifier probes are frequently utilized to better understand how neural networks function. Researchers have approached the problem of determining unit importance in neural networks by probing their learned, internal representations. Linear classifier probes identify highly selective units as the most important for network function. Whether or not a network actually relies on high selectivity units can be tested by removing them from the network using ablation. Surprisingly, when highly selective units are ablated they only produce small performance deficits, and even then only in some cases. In spite of the absence of ablation effects for selective neurons, linear decoding methods can be effectively used to interpret network function, leaving their effectiveness a mystery. To falsify the exclusive role of selectivity in network function and resolve this contradiction, we systematically ablate groups of units in subregions of activation space. Here, we find a weak relationship between neurons identified by probes and those identified by ablation. More specifically, we find that an interaction between selectivity and the average activity of the unit better predicts ablation performance deficits for groups of units in AlexNet, VGG16, MobileNetV2, and ResNet101. Linear decoders are likely somewhat effective because they overlap with those units that are causally important for network function. Interpretability methods could be improved by focusing on causally important units.

View on arXiv PDF

Similar