Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
This addresses the safety issue for users of fine-tuned language models by improving efficiency and reducing harmful outputs, though it is incremental as it builds on prior work using safety examples.
The paper tackled the problem of catastrophic forgetting of safety behaviors in large language models during fine-tuning by proposing a behavior-aware sampling framework that selects safety examples based on instruction-response behavior and semantic diversity. The result was a 41% reduction in harmfulness with only 0.5% additional training data while maintaining helpfulness.
Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.