Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
This work addresses interpretability challenges in transformer models for researchers and practitioners, though it appears incremental as it compares existing and new ablation techniques.
The paper investigates different neuron ablation methods (zero, mean, resampling, and novel peak ablation) in attention heads of transformer models to understand how attention mechanisms represent concepts, finding that each method can minimize performance degradation in different scenarios while resampling typically causes the most deterioration.
The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.