LG MLSep 23, 2025

DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

arXiv:2509.19104v17.11 citations

Originality Incremental advance

AI Analysis

This addresses robustness issues in aligning large language models with human preferences, offering a scalable solution with theoretical guarantees, though it builds incrementally on prior DRO-DPO approaches.

The paper tackles overoptimization in offline RLHF for LLM alignment by introducing DRO-REBEL, a robust method using relative-reward regression with Wasserstein, KL, and χ² ambiguity sets, achieving O(n^{-1/4}) to O(n^{-1/2}) estimation bounds and strong empirical performance on benchmarks like Emotion Alignment and ArmoRM.

Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $χ^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish $O(n^{-1/4})$ estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal $O(n^{-1/2})$ rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ($χ^2$). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with $χ^2$-REBEL showing consistently strong empirical performance. A controlled radius--coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur $O(n^{-1/4})$ rates.

View on arXiv PDF

Similar