CLAIDec 10, 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

arXiv:2512.10150v110 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the problem of preserving safety alignment for users fine-tuning LLMs, offering a practical solution but is incremental as it adapts existing CL methods.

The paper tackles safety degradation in large language models during fine-tuning by framing it as a continual learning problem, showing that CL approaches like DER reduce attack success rates compared to standard fine-tuning while maintaining task utility across multiple models and tasks.

The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes