LGAIMay 23, 2025

KL-regularization Itself is Differentially Private in Bandits and RLHF

arXiv:2505.18407v21 citationsh-index: 12
Originality Incremental advance
AI Analysis

This offers a privacy-preserving method for data-driven decision-making in sensitive applications like RLHF, though it is incremental as it builds on existing regularization techniques.

The paper tackled the problem of achieving differential privacy in decision-making algorithms without adding noise, showing that KL-regularization in stochastic policies inherently provides privacy guarantees in bandits and RLHF settings.

Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes