Towards Optimal Differentially Private Regret Bounds in Linear MDPs
This work addresses privacy concerns in reinforcement learning for personalized decision-making systems using sensitive user data, representing an incremental improvement over existing private methods.
The paper tackles regret minimization under privacy constraints in linear Markov Decision Processes by designing a new differentially private algorithm based on LSVI-UCB++, achieving a regret bound of ̃O(d √(H^3 K) + H^{15/4} d^{7/6} K^{1/2} / ε), which improves over prior private methods and retains near-optimal utility compared to non-private baselines.
We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping $φ(s, a)$, and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves $\widetilde{O}(\sqrt{d^3 H^4 K})$ regret in the non-private setting. Building on recent advances that improve this to near minimax optimal regret $\widetilde{O}(d\sqrt{H^{3}K})$ via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of $\widetilde{O}(d \sqrt{H^3 K} + H^{15/4} d^{7/6} K^{1/2} / ε)$, improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.