LGAIMay 21, 2025

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

arXiv:2505.15694v16 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses theoretical challenges in offline alignment for machine learning, providing insights into privacy-robustness trade-offs, but it is incremental as it builds on existing frameworks.

The paper theoretically analyzes the effects of noisy labels in offline alignment, focusing on the interplay between privacy and adversarial corruption under linear modeling assumptions, and shows that Local differential privacy-then-Corruption (LTC) is more challenging than Corruption-then-Local differential privacy (CTL).

In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes