CLDec 27, 2024

Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

arXiv:2412.19512v37 citationsh-index: 17EMNLP
Originality Incremental advance
AI Analysis

This addresses a critical safety problem for users of fine-tuned LLMs, though it is incremental as it builds on existing model merging techniques.

The paper tackles the problem of catastrophic forgetting and safety degradation in fine-tuned large language models by proposing a method that merges pre- and post-fine-tuned model weights, effectively mitigating safety issues while enhancing downstream task performance without requiring additional safety data.

Fine-tuning large language models (LLMs) for downstream tasks often leads to catastrophic forgetting, notably degrading the safety of originally aligned models. While some existing methods attempt to restore safety by incorporating additional safety data, the quality of such data typically falls short of that used in the original alignment process. Moreover, these high-quality safety datasets are generally inaccessible, making it difficult to fully recover the model's original safety. We ask: How can we preserve safety while improving downstream task performance without additional safety data? We show that simply merging the weights of pre- and post-fine-tuned models effectively mitigates safety degradation while enhancing performance. Experiments across different downstream tasks and models validate the method's practicality and effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes