Language Models Can Learn from Verbal Feedback Without Scalar Rewards
This addresses the issue of scale imbalance and loss of nuance in feedback for LLM training, offering a more expressive approach for AI developers and researchers, though it is incremental as it builds on prior work in conditional generation.
The paper tackles the problem of training large language models (LLMs) with verbal feedback by proposing a feedback-conditional policy (FCP) that treats feedback as a conditioning signal instead of compressing it into scalar rewards, resulting in a method that learns directly from response-feedback pairs and refines itself through online bootstrapping.
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.