LookAhead Tuning: Safer Language Models via Partial Answer Previews
This addresses the safety alignment issue for users adapting LLMs to specific domains, though it is incremental as it builds on existing fine-tuning approaches.
The paper tackles the problem of safety degradation in large language models during fine-tuning by introducing LookAhead Tuning, a lightweight data-driven method that preserves safety through partial answer previews, resulting in maintained safety without sacrificing downstream task performance.
Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.