Classifier-free guidance in LLMs Safety
This addresses safety concerns in LLMs by enabling effective unlearning of harmful content, though it appears incremental as it builds on existing ORPO and classifier-free guidance methods.
The paper tackles the problem of unlearning harmful content from large language models without needing a retaining dataset, achieving significant improvement in unlearning without degrading model performance through a CFG-aware training regime with synthetic replacement data and modified classifier-free guidance during inference.
The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.