CL LGJun 21, 2024

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

arXiv:2406.15518v117.946 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the issue of targeted mitigation for worst-case behavior in language models without frequent retraining, which is incremental as it builds on existing steering vector methods.

The paper tackles the problem of language models behaving unexpectedly post-deployment, such as through jailbreak attacks, by introducing KL-then-steer (KTS), a method that reduces side effects of steering vectors while retaining benefits, preventing 44% of jailbreak attacks on Llama-2-chat-7B while maintaining helpfulness on benign requests.

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

View on arXiv PDF Code

Similar