CLAIMar 30, 2024

Configurable Safety Tuning of Language Models with Synthetic Preference Data

arXiv:2404.00495v115 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This addresses the need for customizable safety in LLM deployment, though it is incremental as it builds on existing DPO methods.

The paper tackles the problem of limited user control in language model fine-tuning by proposing Configurable Safety Tuning (CST), which uses synthetic preference data to enable flexible safety configuration at inference time, allowing deployers to adjust safety preferences via system prompts while retaining model functionality.

State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at https://github.com/vicgalle/configurable-safety-tuning

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes