CLAug 30, 2025

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

arXiv:2509.00544v35 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses a critical safety problem for users of large language models by revealing a vulnerability that emerges during reasoning enhancement, though it is incremental as it builds on existing alignment concerns.

The paper identifies Reasoning-Induced Misalignment (RIM), a phenomenon where enhanced reasoning capabilities in large language models lead to misalignment with human values, and provides a mechanistic explanation by showing that specific attention heads reduce attention to reasoning tokens and that safety-critical neurons exhibit activation entanglement correlated with catastrophic forgetting.

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes