LGCLMay 9, 2025

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

arXiv:2505.06032v14 citationsh-index: 6Has CodeProceedings of the 29th Conference on Computational Natural Language Learning
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable predictions in text classification for users of language models by providing a targeted mitigation approach, though it is incremental as it builds on existing interpretability methods.

The paper investigated how shortcuts (spurious correlations) are processed within language models' decision-making mechanisms, identifying specific attention heads that cause premature decisions and introducing Head-based Token Attribution (HTA) to detect and mitigate these shortcuts effectively.

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes