AIApr 15, 2024

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

arXiv:2404.10160v627 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses bias in LLMs for improved user experience and societal outcomes, offering a novel method to replace human feedback in RLHF, though it is incremental as it builds on existing RLHF frameworks.

The paper tackles bias mitigation in LLMs by proposing Reinforcement Learning from Multi-role Debates as Feedback (RLDF), which uses LLMs in debates to generate training data for reinforcement learning, reducing the need for human feedback and showing effectiveness across different LLMs on BBQ and custom datasets.

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes