CLJul 23, 2024

TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

arXiv:2407.16574v242 citationsh-index: 44
AI Analysis

This work addresses a fine-grained reward problem in RLHF for language models, offering an incremental improvement over existing token-level methods.

The paper tackles the mismatch between sequence-level human preference labels and token-level generation in RLHF by introducing TLCR, which assigns continuous rewards to each token based on discriminator confidence, leading to consistent performance improvements on open-ended generation benchmarks.

Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes