LGMay 12

Targeted Neuron Modulation via Contrastive Pair Search

arXiv:2605.1229060.8
AI Analysis

For researchers and practitioners working on AI safety and interpretability, this work provides a method for targeted behavioral steering in language models with minimal quality tradeoffs, though the findings are incremental as they build on existing neuron attribution techniques.

The paper introduces contrastive neuron attribution (CNA) to identify MLP neurons that distinguish harmful from benign prompts, and shows that ablating these neurons reduces refusal rates by over 50% on a jailbreak benchmark without degrading output coherence. The method reveals that base models contain similar neuron structures but steering them does not produce behavioral change, suggesting alignment fine-tuning creates a sparse refusal gate.

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes