CLAIOct 23, 2025

Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

arXiv:2510.21885v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the safety issue for users of fine-tuned language models by improving efficiency and reducing harmful outputs, though it is incremental as it builds on prior work using safety examples.

The paper tackled the problem of catastrophic forgetting of safety behaviors in large language models during fine-tuning by proposing a behavior-aware sampling framework that selects safety examples based on instruction-response behavior and semantic diversity. The result was a 41% reduction in harmfulness with only 0.5% additional training data while maintaining helpfulness.

Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes