CL LGJun 12, 2024

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras, Aleksandar Petrov, Philip H. S. Torr, M. Pawan Kumar, Adel Bibi

arXiv:2406.10288v310.016 citations

Originality Highly original

AI Analysis

This addresses a critical safety issue for users of fine-tuned LLMs in specialized domains, representing a novel approach beyond incremental improvements.

The paper tackles the problem of safety risks in task-specific fine-tuning of large language models, where malicious manipulation of datasets can increase dangerous behaviors, and proposes a mitigation strategy that mixes safety data to effectively re-establish safety alignment while maintaining task performance, achieving significant improvements over baselines.

Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning - where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions) - can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.

View on arXiv PDF

Similar