CLAILGAPMEDec 3, 2024

Removing Spurious Correlation from Neural Network Interpretations

arXiv:2412.02893v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the issue of misleading interpretations in AI systems for researchers and practitioners, but it is incremental as it builds on existing causal mediation methods.

The paper tackled the problem of spurious correlations in neural network interpretations due to confounders like conversation topic, and showed that adjusting for topic effect reduces toxicity localization in large language models.

The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes