Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
This addresses the critical problem of balancing safety and helpfulness in language models for AI developers and users, representing a novel method for a known bottleneck rather than an incremental improvement.
The paper tackles the conflict between safety and helpfulness in fine-tuning large language models by proposing Bi-Factorial Preference Optimization (BFPO), a supervised learning framework that re-parameterizes joint RLHF objectives, resulting in significant outperformance in both safety and helpfulness while using less than 10% of computational resources and human labor compared to existing methods.
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.