LGAIMar 7, 2023

Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

arXiv:2303.03751v348 citationsh-index: 45
AI Analysis

This addresses the challenge of aligning AI systems with human intentions when only comparative feedback is available, though it appears incremental as it adapts existing zeroth-order methods to ranking oracles.

The paper tackles the problem of optimizing black-box objective functions using only ranking feedback from human judges, inspired by Reinforcement Learning with Human Feedback (RLHF). It introduces ZO-RankSGD, a zeroth-order optimization algorithm that significantly enhances image detail in diffusion models with only a few rounds of human feedback.

In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle-a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. Such challenge is inspired from Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. Moreover, ZO-RankSGD is readily applicable to policy optimization problems in Reinforcement Learning (RL), particularly when only ranking oracles for the episode reward are available. Last but not least, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes