CLOct 28, 2024

Reward Modeling with Weak Supervision for Language Models

arXiv:2410.20869v11 citationsh-index: 92025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)
Originality Incremental advance
AI Analysis

This work addresses the cost and scalability of aligning language models with user intentions, though it is incremental as it builds on existing RLHF methods.

The paper tackles the problem of expensive human labeling for reward models in reinforcement learning from human feedback (RLHF) by introducing weak supervision to extend datasets, showing it significantly improves performance on smaller datasets but has diminishing returns on larger ones.

Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes