The Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks
This reveals a fundamental limitation in gradient-based methods for model interpretability and pruning, impacting researchers and practitioners in machine learning.
The paper demonstrates that gradient magnitude fails to predict causal importance in neural networks, especially on complex tasks, where removing low-gradient components can reduce out-of-distribution accuracy by 32% and high-gradient removal is unpredictably harmful.
Removing ''important'' high-gradient components from a neural network can improve generalization, while removing unimportant'' low-gradient components can destroy it. We demonstrate this paradox by formalizing the \textit{Gradient-Causal Gap} in Transformers trained on algorithmic tasks. While gradient magnitude and causal importance align on simple tasks ($ρ=0.73$ for reversal), this relationship collapses as task complexity increases ($ρ=0.32$ for sorting), sometimes becoming inverted ($ρ=-0.11$). Pruning experiments reveal that gradient magnitude is not merely inaccurate but \textit{unpredictably} so. Removing low-gradient ''Hidden Heroes'' consistently devastates OOD accuracy ($-32\%$). Removing high-gradient ''Gradient Bloats'' is a coin flip: harmless in most seeds (indicating optimization noise), catastrophic in others (indicating overfitting circuits). This unpredictability means gradient-based pruning cannot reliably preserve model capabilities.