AIApr 15, 2024

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Ruoxi Cheng, Haoxuan Ma, Shuirong Cao, Jiaqi Li, Aihua Pei, Zhiqiang Wang, Pengliang Ji, Haoyu Wang, Jiaqi Huo

arXiv:2404.10160v619.229 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses bias in LLMs for improved user experience and societal outcomes, offering a novel method to replace human feedback in RLHF, though it is incremental as it builds on existing RLHF frameworks.

The paper tackles bias mitigation in LLMs by proposing Reinforcement Learning from Multi-role Debates as Feedback (RLDF), which uses LLMs in debates to generate training data for reinforcement learning, reducing the need for human feedback and showing effectiveness across different LLMs on BBQ and custom datasets.

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.

View on arXiv PDF

Similar