Removing Spurious Correlation from Neural Network Interpretations
This addresses the issue of misleading interpretations in AI systems for researchers and practitioners, but it is incremental as it builds on existing causal mediation methods.
The paper tackled the problem of spurious correlations in neural network interpretations due to confounders like conversation topic, and showed that adjusting for topic effect reduces toxicity localization in large language models.
The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.