CL AI LGSep 18, 2024

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan, Nouar AlDahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki

arXiv:2409.15361v14.86 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses safety risks for users of fine-tuned LLMs in applications like translation and coding, highlighting a critical gap in current alignment methods.

The paper investigates how fine-tuning large language models on downstream tasks like code generation and translation degrades safety guardrails, finding that up to 92% of harmful prompts are answered in some cases, and proposes a multitask safety dataset that reduces attack success rates without harming helpfulness.

Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model's overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

View on arXiv PDF

Similar