CL LGNov 7, 2025

Steering Language Models with Weight Arithmetic

arXiv:2511.05408v17 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently leveraging narrow training data to improve model alignment and generalization for AI safety applications, representing an incremental advancement in post-training techniques.

The authors tackled the problem of controlling unintended behaviors in large language models by proposing contrastive weight steering, a post-training method that edits model parameters using weight arithmetic to isolate and modify behavior directions, resulting in stronger out-of-distribution control and mitigation of issues like sycophancy and misalignment while preserving task performance.

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

View on arXiv PDF

Similar