CLLGDec 20, 2023

Learning and Forgetting Unsafe Examples in Large Language Models

arXiv:2312.12736v227 citationsh-index: 30ICML
Originality Incremental advance
AI Analysis

This addresses safety risks in custom finetuning of LLMs for users deploying third-party data, though it is incremental as it builds on existing finetuning and safety methods.

The paper tackles the problem of large language models learning unsafe content from noisy custom finetuning data, finding that models tend to forget such content more than other examples when finetuned on safer data, and introduces the ForgetFilter algorithm to filter unsafe data based on forgetting signals, achieving 75% lower toxicity than no safety measures and 62% lower than self-correction.

As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes