LGAIMar 8, 2025

ROCM: RLHF on consistency models

arXiv:2503.06171v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the problem of slow generation and inefficient training in diffusion models for generative modeling, particularly when incorporating RLHF, by leveraging consistency models, representing an incremental improvement in RLHF techniques.

The paper tackles the challenge of applying Reinforcement Learning from Human Feedback (RLHF) to consistency models, which are efficient generative models, by proposing a direct reward optimization framework with distributional regularization to enhance stability and prevent reward hacking. The result is a method that achieves competitive or superior performance compared to policy gradient-based RLHF methods, as validated by empirical results across automatic metrics and human evaluation.

Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes