MLMay 22
Learning Kernel-Based MDPs from Episodic Preferential FeedbackNikola Pavlovic, Sattar Vakili, Qing Zhao
Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons.We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.
MLJan 13, 2025
Differentially Private Kernelized Contextual BanditsNikola Pavlovic, Sudeep Salgia, Qing Zhao
We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS). We study this problem under the additional constraint of joint differential privacy, where the agents needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that improves upon the state of the art and achieves an error rate of $\mathcal{O}\left(\sqrt{\frac{γ_T}{T}} + \frac{γ_T}{T \varepsilon}\right)$ after $T$ queries for a large class of kernel families, where $γ_T$ represents the effective dimensionality of the kernel and $\varepsilon > 0$ is the privacy parameter. Our results are based on a novel estimator for the reward function that simultaneously enjoys high utility along with a low-sensitivity to observed rewards and contexts, which is crucial to obtain an order optimal learning performance with improved dependence on the privacy parameter.
LGFeb 20, 2024
Order-Optimal Regret in Distributed Kernel Bandits using Uniform Sampling with Shared RandomnessNikola Pavlovic, Sudeep Salgia, Qing Zhao
We consider distributed kernel bandits where $N$ agents aim to collaboratively maximize an unknown reward function that lies in a reproducing kernel Hilbert space. Each agent sequentially queries the function to obtain noisy observations at the query points. Agents can share information through a central server, with the objective of minimizing regret that is accumulating over time $T$ and aggregating over agents. We develop the first algorithm that achieves the optimal regret order (as defined by centralized learning) with a communication cost that is sublinear in both $N$ and $T$. The key features of the proposed algorithm are the uniform exploration at the local agents and shared randomness with the central server. Working together with the sparse approximation of the GP model, these two key components make it possible to preserve the learning rate of the centralized setting at a diminishing rate of communication.
MLJul 18, 2025
Differential Privacy in Kernelized Contextual Bandits via Random ProjectionsNikola Pavlovic, Sudeep Salgia, Qing Zhao
We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space. We study this problem under an additional constraint of Differential Privacy, where the agent needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that achieves the state-of-the-art cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{γ_TT}+\frac{γ_T}{\varepsilon_{\mathrm{DP}}})$ and $\widetilde{\mathcal{O}}(\sqrt{γ_TT}+\frac{γ_T\sqrt{T}}{\varepsilon_{\mathrm{DP}}})$ over a time horizon of $T$ in the joint and local models of differential privacy, respectively, where $γ_T$ is the effective dimension of the kernel and $\varepsilon_{\mathrm{DP}} > 0$ is the privacy parameter. The key ingredient of the proposed algorithm is a novel private kernel-ridge regression estimator which is based on a combination of private covariance estimation and private random projections. It offers a significantly reduced sensitivity compared to its classical counterpart while maintaining a high prediction accuracy, allowing our algorithm to achieve the state-of-the-art performance guarantees.
LGJan 6, 2025
Characterizing the Accuracy-Communication-Privacy Trade-off in Distributed Stochastic Convex OptimizationSudeep Salgia, Nikola Pavlovic, Yuejie Chi et al.
We consider the problem of differentially private stochastic convex optimization (DP-SCO) in a distributed setting with $M$ clients, where each of them has a local dataset of $N$ i.i.d. data samples from an underlying data distribution. The objective is to design an algorithm to minimize a convex population loss using a collaborative effort across $M$ clients, while ensuring the privacy of the local datasets. In this work, we investigate the accuracy-communication-privacy trade-off for this problem. We establish matching converse and achievability results using a novel lower bound and a new algorithm for distributed DP-SCO based on Vaidya's plane cutting method. Thus, our results provide a complete characterization of the accuracy-communication-privacy trade-off for DP-SCO in the distributed setting.