CLSep 14, 2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou

Stanford

arXiv:2309.07875v330.2398 citationsh-index: 109Has Code

Originality Incremental advance

AI Analysis

This tackles the problem of harmful content generation in AI assistants for users and developers, though it is incremental as it builds on existing fine-tuning methods.

The paper addresses the safety risks of instruction-tuned large language models, showing that popular models are highly unsafe, and demonstrates that adding just 3% safety examples during fine-tuning can substantially improve safety without significantly reducing helpfulness on benchmarks.

Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.

View on arXiv PDF Code

Similar