Offline and Online KL-Regularized RLHF under Differential Privacy
This work addresses privacy-preserving alignment of large language models for users concerned with data confidentiality, offering incremental improvements by extending existing RLHF methods with differential privacy constraints.
The paper tackles reinforcement learning from human feedback (RLHF) with KL-regularization under local differential privacy, providing algorithms for offline and online settings with theoretical guarantees. In the offline setting, it achieves a suboptimality gap of Õ(1/[(e^ε-1)² n]) and proves optimality, while in the online setting, it derives a logarithmic regret bound of O(d_F log(N_F·T)/(e^ε-1)²).
In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $ε$ local differential privacy ($ε$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^ε-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^ε-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.